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FOREWORD 



In December 1969, a task force was organized for the purpose of advising on the 
scope and organization of a series of reports regarding ability grouping in the 
public schools of the United States. Those involved in the planning included: 



Warren G. 

Miriam M. Bryan 
Paul I . Cli f f ord 
John E. Dobbin 
Gordon Foster 



Investigator 

Edmund W, Gordon 
Roger T . Lennon 
A. John Stauffer 
Ralph V. Tyler 



Findley, Principal 



The Office of Education and the U, S. Department of Health , Education and Welfare 
were represented by Peter Briggs, Christopher Hagen, and Rosa D. Wiener, 

Four documents were planned and have now been completed. 



I . 


Common Practices in the Use of Tests for 
Students in Public Schools. 


Grouping 


I [ , 


The Impact of Ability Grouping cf School 
Affective Development, Ethnic Separation, 
economic Separation, 


Achievement 
, and Socio- 


III. 


Problems and Utilities Involved in the Use of Tests 
for Grouping Children with Limited Backgrounds, and 
Alternative Strategies to Such Grouping. 


IV. 


Conclusions and Recommendations 





Mrs, Bryan prepared Document I, based on questionnaire responses from schoolmen 
and supplementary data from Miss Wiener. Dr. Clifford and Mr, Dominick Esposito 
prepared the basic content of Document II, which was then edited by Mrs, Bryan. 
Contributions to Document III were secured from Mrs. Bryan, Mr. Dobbin, Dr. Findley, 
Mrs. Blythe Mitchell, and Pr. Stauffer. Ihe summary and conclusions were prepared 
by Dr , Findley . 

The work presented herein was performed pursuant to a grant from the U, S. Office 
of Education, Bureau of Elementary and Secondary Education, Department of Health, 
Education, and Welfare. Howevet , the opinions expressed herein do not necessarily 
reflect the position or policy of the U. S. Office of Education, and no official 
endorsement by the U. S. Office of Education should be inferred. 

Additional copies of the four documents are available upon reouest. Write: 




Dr. Morrill M. Hall, Director 
Center for Educational Improvement 
College of Education 
University of Georgia 
Athens, Georgia 30601 
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INTRODUCTION 



This document is presented in two parts: Part A is concerned with the 
problems and utilities involved in the use of tests for grouping children 
with limited backgrounds for purposes of instruction; Part R presents 
descriptions of alternative strategies. The first part i/as provided for 
in the outline originally set by the committee; the second part was added 
when the committee became impressed with the number of alternative strategies 
suggested in the literature as being effective and efficient. The strategies 
presented are merely representative of the variety of alternatives available. 
The reader may be able to add others. 

A . TH E PROBLEMS AND UTILITIES INVOLVED IN THE USE OF 
TESTS FOR GROUPING CHILDREN WITH LIMITED BACKGROUNDS 



The search for useful information regarding the validity and reliability 
of standardized aptitude and achievement tests for use in grouping children 
with limited backgrounds for purposes of instruction has been an exhaustive 
but, unfortunately, not a very productive one. Not a single study, for 
example, among the more than two hundred located was found to involve all 
three aspects of the topic: test validity and reliability, culturally limited 

populations, and homogeneous grouping. It has been necessary, therefore, 
to attempt to go beyond the data presented and to make calculated inferences 
as to what might be expected to occur under certain combinations of circum- 
stances . 



Definition of Terms 



The definition of a few terms is in order here if the intent of this 
document is to be clearly understood. These definitions may be read first 
or in conjunction with the discussion that follows. They are presented in 
a sequence of importance for understanding the material of this document. 
Wherever a term used in a definition is not understood, its definition is 
to be found later on. 

1. In this document, concern will he for the validity not only of the 
tests themselves but also of their use for the whole population. Are the 
tests giving us the kind of information about students and about programs 
of instruction that we really want to know? In particular, do the tests 
provide comparable information about students with different backgrounds thac 
can be useful in conducting the instructional program? Note particularly the 
definition of construct or pure validity given last. 
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The validity of a test refers to the extent to which a test does the 
job fot which it is intended. Validity has different connotations for various 
kinds of tests and, accordingly, different kinds of validity are appropriate 
for them. For example* the validity of an achievement test is the extent to 
which the content of the test represents a balanced and adequate sampling of 
the outcomes (knowledge, skills, etc.) of the course or instructional program 
it is intended to cover ( content, face, or curricular validity ). The validity 
of an aptitude or readiness test is the extent to which it accurately indicates 
future learning success in the area for which it is used as a predictor (pre- 
dicti v e validity) . The validity of a personality test is the extent to which 
the test yields an accurate description of an individual's personality traits 
or personality organization as of that moment ( status or concurrent validity ) . 

The validity of a test or of a procedure for the use of a test for a 
particular purpose involves a combination of concurrent validity for indicating 
tha present status of individuals in mastering a subject, predictive validity 
for indicating the probable later achievement of individuals in mastering that 
subject under specified instructional procedures, and freedom from correlation 
with extraneous variables on the part of the original or final measures of 
achievement. This total requirement may be called construc t or p ure v a lidity . 
This concept of validity may be extended to other measures- -snlf -concept ratings, 
personality measures, etc. — by substituting these terns for in this 

definition . 

2. The r eliability of a tost refers to the extent to which a test is 
consistent in measuring whatever it does measure: dependability, stability, 

relative freedom from errors of measurement. It is usually estimated by 
some form of reliability c oe fficient or by the standard erro : o>: measuremen t. 

The higher the re liability coefficient atid the smalier the stan dard err o r of 
measurement , the more reliable is the te^tv 

Reliability coefficients take their names from the method of deter- 
mination. In this document we will be most frequently concerned with the 
alternative form coeffici ent:, which is generally obtained by giving two 
parallel forms uf a tent (with equal content, means, and variances) to the 
same group of individuals on closely succeeding days and correlating the 
results; the split-half coefficient which is obtained by correlating scores on 
one-half of a test with scores on the other half; the Kuder- Rich ardson coeff i- 
cient , which is obtained from item statistics of a single administration of 
one form of a test; and the test-retes t c oefficient , which is obtained by 
administering the same test a second time after a short interval and correlating 
the two sets of scores. The alternate form estimate is generally preferred 
because it reflects the day-to-day variability ImpJicit in ordinary use of 
tests . 



3. The standard error of measurement is an estimate of the magnitude of 
the M error of measurement 1 ' in a score — the amount by which an obtained score 
differs from a hypothetical true score. It io the standard deviation of the 
differences between actual scores and theoretical true scores of the same 
individuals on a test. The standard error is an amount such that in about 
two-thirds of che cases the obtained score would not differ from the true 
score by more than one standard error. 

o 
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4. A standard deviation is a measure of the variability or dispersion 
of a set of scores. The more the scores cluster around the mean, the smaller 
the standard deviation. It is the "root-mean-square deviation’ 1 originated by 
astronomers . 

5. Correlatio n is the degree of agreement between two sets of data. In this 
document, the data will usually be scores on two tests for the same individuals, 
or scores on one test and marks given to the same individuals by a teacher. Less 
often they will be correlations between scores on other measures—interes t 
inventories, personality scales, self-concept ratings — and test scores or marks. 

Correlation is expressed in terms of a correlation coefficient , generally 
designated by the symbol r. This is an abstract number that can take on values 
between 0 and 100. The value of 1.00, almost never found, shews perfect agree- 
ment in the rank order of scores on one variable and scores on a second variable, 
The value 0, as that figure implies, shows absence of relationship between two 
sets of scores or random association between the sets. When the coefficient 
is preceded by a plus sign (+) or is presented without a sign preceding it, the 
correlation is said to be positive, with high scores on the first variable being 
most often associated with high scores on the second variable and low scores on 
the two variables also being associated with each other. When the coefficient is 
preceded by a minus sign (-) , the correlation is said to be negative. This 
occurs less frequently, as one might expect, in that high scores on the first 
variable are most often associated with low scores on the second variable, and vice 
versa. 



6. Multiple correlation is the degree of agreement between one variable, the 
criterion, and the best-weighted combination of a set of two cr more other vari- 
ables. An example would be the correlation between two test scores obtained 

at the beginning of a period of instruction — say, an achievement test score and 
an intelligence test score--and another test score at the end of instruction, 
generally an achievement test score in the same subject. A common example from 
outside the scope of this document would be the multiple correlation between 
high school average and entrance test scores used as predictors and grade point 
average at the end of the freshman year in college. Multiple correlation is 
expressed in temr* of a coefficient of multiple correlation , nesignatea bv the 
symbol R to distinguish it from r, the symbol for simple correlation between 
two variables. This coefficient also takes on values between 0 and 1.00, When 
compared with the simple correlation between each of the predictor variables 
separately and the criterion, it shows the improvement in efficiency of pre- 
diction achieved by using the several variables in combination to predict the 
criterion. Multiple correlation R is always expressed without a sign because 
it can be used only to express the strength of a relationship. 

7. A regress io n equation is an equation for predicting a criterion measure 
from the information provided by a single predictor or a set of two or more 
predictors. If a single predictor is used, we speak of simple regression or a 
simple regression equation; if two or more predictors are used, we speak of 
multiple regression, or a isultiple regression equation. Correlation as des- 
cribed in definitions 5 and 6 preceding is the basis for determining the 
coefficients to be used in the equation. 

o 
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Cultural Bias in Tests 



The concept of cultural bias is receiving new attention. In the late 1940's 
and early 1950’s much professional effort was devoted to analyzing tests with a 
view to producing "culture-free" or "culture-fair" tests (MachGver, 1943; Turnbull, 
1949; Davis, et al., 1951). Continuing efforts have been made by Cattell (1963) 
in his distinction between "crystallized" and "fluid" intelligence. Lorge (1952) 
pronounced a definitive evaluation of such efforts generally by pointing out that 
the major source of bias is to be found in society’s "demands" and that tests must 
be related to those biases to define the cultural handicap of the disadvantaged in 
meeting the demands so that efforts may be directed toward correcting disadvantage 
and measuring progress in correcting it In individuals. 

Two recent reviews, by Lambert (1964) and Anastasi (1964) merit mention as 
references here. Lambert summarizes information about a grest variety of measures 
of aptitude and achievement designed to be "culture-fair" and includes much obtained 
from direct correspondence or conversation with interested researchers. Anastasi 
clarifies the relations among a number of the measures and particularly the concept 
of culture-fairness as that varies with different groups studied and purposes 
served. For example, she points out that 

It is commonly assumed that nonverbal tests are more nearly 
culture-fair than are verbal tests. This assumption is 
obviously correct for persons who speak different languages. 

But for groups speaking a common language, whose cultures 
differ in other important respects, verbal tests may be less 
culturally loaded than tests of a predominantly spatial or 
perceptual nature. 

Anastasi also points to factors that may normally be considered to limit the 
"culture fairness" of a test, but have validity in a particular situation. 

Thus 

. . .the same factor that lowered the test score would also 
handicap the individual in his educational and vocational 
progress and in many ocher activities of daily life. Simi- 
larly, slow work habits, emotional insecurity, low achievement 
drive, lack of interest in abstract problems, and many other 
culturally linked conditions affecting test scores are also 
likely to influence the relatively broad area of criterion 
behavior . 

The reader should not be surprised, thm., to find tests pronounced unbiased 
simply because they reflect the attributes that predict further achievement in 
sch:ol . 



The view taken here separates society’s demands into two chief parts: ines- 

capable demands of living in an increasingly technological, urban, somewhat closed 
culture, and demands enforced by cultural distinctions of observable behavior largely 
associated with speech and historical knowledge. A current cigarette advertisement 
has capitalized on this by asking, "What do you want: good grammar or good taste?" 
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A coimnon speech fault in English is use of the double negative, a "fault" generally 
reenforced fcr the disadvantaged child by the constant pressure of his home and 
neighborhood; yet in most modern foreign languages, the double negative is correct 
usage to achieve emphasis. And American students have to learn to correct their 
fault of forgetting to use the double negative! 

Spelling is another mark of cultural bias. Among the readers of a publication 
like this, or of any publication intended for general currency, unfavorable notice 
would certainly be taVen by many of faulty spelling if at all frequent. Yet it 
is doubtful that the meaning would have been unclear, as witness the fact that others 
will read by each error without noticing it. It may be noted that spelling enjoys 
the status of a schoool subject only in English-speaking countries because English 
is the only language not uniformly phonetic. Early emphasis on formal approaches 
to correct spelling can intimidate an otherwise competent child from exercising 
a free flow of writing for fear of misspelling. How much better a situation in 
which a child writes to inform distant parents that he has an "earake," enabling 
the family to swing into action immediately. "What do you want: good spelling 

or good medicine?" 

The effect of frequent correction for the "stigmata" of poor speech and poor 
spelling is subject to review and curricular revision if it is agreed that early 
overemphasis on correctness produces academic and affective deficiencies. Cer- 
tainly, there is a distinction now being pondered between society’s cultural demands 
that all be able to read, calculate, communicate, and acquire a background of struc- 
tured knowledge in order to participate effectively in society, and society’s 
cultural biases which have been illustrated here from grammar and spelling, but 
which go much deeper. 

Having made the above observations to put the matter of cultural demands 
in perspective, it is necessary to return to the earlier observations attributed 
to Lorge and Anastasi, The tests themselves as of any date must be judged in 
terms of their validity for predicting the currently accepted goaJs under current 
procedures of instruction. 

The discussion that follows of Publishers’ Test Information is limited 
to a sample of tests that are representative of the sorts frequently used in 
ability grouping at various grade levels from preschool to college. Considerable 
detail is given about a few tests widely used in elementary and secondary 
schools in grouping and in evaluating achievement. In addition, the most popular 
measure for use at the preschool level, a major college test and two new tests 
specially designed to meet the problems of testing minority children are discussed 
briefly. Thereafter the discussion proceeds in a subsequent section to relevant 
research studies of less specific emphasis. 
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Publishers * Test Information 



The search for information about tests most widely used in school testing 
situations was initiated with a letter to each of seven major publishers of 
standardized tests asking for any data or other information they might have 
available about their own tests that would je pertinent to their use in ability 
grouping. Particular interest was expressed in predictive validity and/or 
reliability coefficients that the publishers themselves might have developed 
for groups differentiated by socioeconomic levels, or by race or ethnic background. 

While only four of the seven publishers could provide useful data about 
tests on which they had done research, others reported research in progress, and 
all indicated that they were sensitive to the need for testing instruments free 
from cultural bias, Some reported the addition of members of minority groups 
to their professional staffs and provision for review of their test items by 
representative committees to detect instances of item bias, 

Data supplied by test publishers are presented below. For some tests only 
reliability data are available; for others there are data regarding both 
reliability and predictive validity. With very few exceptions* these statistics 
show the tests to be unbiased with respect to any minority group, ethnic or 
socioeconomic; where such statistics f avor one group over another, they 
appear to favor the minority rather than the majority group. 

For the Preschool Invent ory, formerly called the Caldwell Preschool Inven - 
tory, an instrument designed for use in the Head Start Program, Educational 
Testing Service reports deciles, summary statistics, and statistical charac- 
teristics for 317 children in eight kindergarten centers in North Carolina. 

This sample was divided into three groups by a consideration of each child's 
standing on two measures of socioeconomic status, the "Coleman" Index and an 
adaptation of the Ypsilanti Home Environment Scale , itself an adaptation of 
Wolfe* s Environmental Process Scale . The two measures correlated .31 with 
each othar. Scores for children at three socioeconomic (SES) levels increased 
from the low to the high group but the differences Jn mean score were not 
significant. KR 2 Q reliability coefficients were .91, .89, .91, and .92 for 
.low, middle, and high SES groups and the total group, respectively; for the 
total standardization sample the KR 2 Q reliability coefficient was .91. Individ- 
ual items which appeared to be unusually difficult or unusually easy for the 
low SES group were, more often than not, the same items that were unusually diffi- 
cult or unusually easy for the total North Carolina grou*' and for tie standard- 
ization sample. 

In the Directions Manual for the Cl>, ■.«,»: -D arret t Prereadin? Patter) , published 
by Personnel Press, Inc., split-half reliability coefficients are presented 
for four groups of first-grade children selected because of their difference 
from the norming population or because they might present special testing problems 
resulting in unreliable work on the 1 ;s ts. These groups are described as follows: 

Group A Kindergarten pupils tested in May; 120 children in 3 
classes, one system. Mean total score 74.85. 

O ‘-Richardson reliability coefficients, Formula 20. 

ERIC 
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Groip B First grades in three-bi-lingual , rural schools in the 
Southwest; 63 pupils, mean total score 24.4 

Group C First grade in a rural, white, low-ability school; 

52 pupils, mean total score 20.0. 

Group D First grade in a rural, Negro, low-ability school; 

28 pupils, mean total score 24.2. 

Group E Five first grades in two mixed-ethnic, deprived. 

neighborhood schools in a very large city: 111 

pupils, mean total score 25.6. 



The reliability data for groups B, C, D, and E are presented below, together with 
those for th° norms group. ?.Tie data for Group A are omitted because they are 
for a group that is exceptional only in age (very young) rather in cultural 
background , 



Table 1 

C lymer-Bar rett Prereading _ Battery 
Reliability Coefficients for Special Groups and Norms Group 



Test 




Special 


Groups 




Norms 

Group 


B 


C 


D 


E 


Visual Discrimination 


.96 


.97 


.94 


.97 


.94 


Auditory ^iscrim. 


.94 


.98 


>89 


.94 


.82 


Visual-Motor 


.91 


.94 




.95 


.89 


Total (Short Form) 


.94 


.97 


.93 


.96 


.92 


Total (Full Form) 


.97 


.98 1 


1 -96 


.98 


.95 



The data indicate th at even though the Clymer-Barr it t P rereading Batt ery may be 
considerably more difficult for children in educationally atypical groups, it 
performs as well with chem as it does with early first graders in the usual 
kinds of educational settings, so far as reliability is concerned. 

By far :he largest amount of data based on the use of tests with atypical 
groups has been published by Harcourt, Brace and Jovac jvi ch , Inc. This is 
especially appropriate since their tests are used so widely in so many kinas 
of testing situations, espec.ilry those involving grouping. 

For the Metropolitan Readiness Tests , the Manual of Directions provides 
split-half reliability data for seven different school systems at different 
socioeconomic levels with mean total scores ranging from 51 to 66. Since the 
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subtests are so short that it is recommended that relatively little significance 
be attached to the subtest scores of individual students * only the reliability 
coefficients for total score are shown. 

Table 2 

Metropolitan Readiness Tests 

Split-Half Reliability Data for Form A in Seven School Systems 



School 

System 


er of 
Students 


Grade 


Month of 
Testing 


Mean 

Score 


* 

r n 


A 


167 


i 


October 


63.0 


.91 


B 


173 


i 


October 


57.9 


.91 


C 


200 


i 


October 


50.8 


.94 


D 


88 


Kdg. 


Kay 


66.4 


.95 


E 


86 


Kdg. 


Hay 


54.0 


.93 


F 


59 


Kdg. 


May 


53.4 


.91 


G 


65 


Kdg. 


May 


52.9 


.90 




^Indicates 


split-half 


reliability 


coefficient . 





Table 3 

Metropolitan Readiness Tests 
Split-Half Reliability Data for End-of-Kindergar ten 
Administration of Form B in Systems D, E* F, G 



School 
Sys tem 


Humber of 
Students 


Mean 
S core 


r ll 


D 


82 


66.5 


.93 


E 


91 


53.2 


.94 


F 


55 


55.8 


.92 


G 


61 


51.0 


.93 



Alternate fom, or test-retost reliability data are also given for end-of- 
kindergarten children in systems D, E, F, G. F^r both Form A first-Form B second 
and Form B first-Form A second groups* total score reliabilities of .91 are reported. 



o 
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With the observed reliability values for total score ranging from .90 to *95 
and the measurement error of an individual score ranging from 3 to 5 points, as 
reported by the publisher, it would appear that total scores on the Me t ropoli t an 
R eadiness Tests may be used with considerable confidence for the purposes for whath 
the tests are recommended. 

The manual also provides predictive va idity data for a variety of student 
groups and circumstances. The basic data include correlations between readiness 
scores and scores on the Stanford Achievem e nt Tes t: Primary I { 19 G 4 Revision) 

the following May for 9497 students in the USOE First-Grade Reading Study of 
1964-65 who participated in the standardization for the Readiness tests. Mitchell 
(1967) later used the scores of the same students to investigate the predictive 
validity of these tests and the Murphv-Dur rell Reading Readiness Analysis by ethnic 
and socioeconomic differentiation. Certain of the Mitchell data, available upon 
request from the publisher, aie summarized in Tables 4-6 on pages 10-12. 

It is well to reiterate here the rationale of the statements above and below 
regarding bias in the tests. A test is adjudged to be biased only insofar as it 
provides information that leads to faulty inferences. If a test give*; dependable 
evidence of present status on a variable for members of a minority group, as mea- 
sured by a high reliability coefficient, and if it also predicts subsequent achieve- 
ment as well for minority groups as for the general population represented in the 
norms as measured by equally high correlation w ;h achievement scores, the tes w is 
unbiased in its use for these purposes. The test may yield lower scores for minority 
group students, reflecting a disadvantage for the group on that cest that is matched 
by the disadvantage these students experience in meeting tie standard demands of 
instruction. Thus, the bias is in past conditions or in the absence of effective 
adaptation of instruction, rather than in the tests. 

The results shown in Table 4 do not support the hypothesis that the Metropolitan 
or the Murphy-Durrell tests have lower predictive validity for minority group students 
than for white students. For the Metropolitan tests s of the 15 correlations shown, 

12 favor minority groups; for the Murphy-Durrell tP'ts, nine of the 15 correlations 
favor the minority groups. Nor is there an* 1 ~ - tent pattern of advantage or 
disadvantage among the three minority groups, 
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In terns of socioec jnoraic differentiate n» the predictive validities of the 
Metropolitan Readiness Te sts appear to be considerably higher for the scores of 
children in less privileged comtnunit ies than for those in more privileged communi- 
ties. In comparing the predictive validities in tables 5 and 6, however, it is 
important to consider the relative size of the standard deviations of the scores 
on the Readiness tests. The differences indicate greater variability for the 
readiness of children in the less privileged communities, and this would act to 
inflate the validities for these groups. Had the standard deviations for the two 
kinds of communities been more comparable, the differences in validities would 
nave been less pronounced. 

For the Otis-Lenron Mental Ability Test , also published by Harcourt, Brace 
and Jovanovichj Inc., split-half reliability data are provided for five socio- 
economic levels of community. These are shown in Table 7 below, 



Table 7 

Otis-Lennon Men ca l Ability Test 

Split-Half Reliability Coefficients for Socioeconomic Strata 
of the National Standardization Sample 









Otis-Lennon Level and 


Grade 




Number of 






Primary ; 
Grade 1 


L Elementary I 
Grade 3 


Elementary 
Grade 5 


II Intermediate 
Grade 8 


Advanced 
Grade 11 


School 

Within 


Systems 

Stratun 


Socioeconomic 

Level* 

High Median 
6 Range 


.87 

.79- .90 


.JO 

. 87- .95 


.94 

.90-. 95 


.94 

.92-. 95 


.94 

.94-. 96 




9 


Above 

Average 


Median 

Range 


.88 

.85-. 91 


.94 

.90-. 95 


.95 

.94-. 96 


.94 

.92-. 96 


.94 

.93-. 96 




11 


Average 


Median 

Range 


.90 

.87-. 93 


.92 

. 87- .93 


.94 

.83-. 96 


.95 

.93-. 96 


.95 

.92-. 97 




17 


Below 

Average 


Median 

Range 


.91 

.88-. 93 


.92 

.89*-. 94 


.95 

.94-. 97 


.95 

.92-. 97 


.94 

.93-. 96 




9 


Low 


Median 

Range 


.90 

.89-. 93 


.92 

.90-. 94 


.95 

.93-. 97 


.96 

.93-. 96 


.95 

.92-, 96 




8 


Complete 

Standard- 

ization 

Sample 




.90 


.92 


.95 


.95 


.95 






*Public school 


sys terns 


with less than 


300 total 


enrollment were 


not included 
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In addition to the reliable y ML.‘. for different socioeconomic strata, the 
Technical Handbook accompanying* che n tis-Lennon tests reports standard errors of 
measurement for successive score levels from IQ 50-70 to IQ 128-150, These 
range from 3.2 to 7.9 for single grades at single IQ ranges, from 4,4 to 6.6 
for IQ level average, and average 4.9 for the total group, 

Validity data for the Otis-Lennon test are reported for a large number of 
schools with mean IQs as high as 110 and as low as 94. Correlations between 
Otis-Lennon scores and scores on several widely used achievement test batteries 
and ability tests and with end-of-year course grades are given. School districts 
tested are identified as to SES level. Correlations between Otis-Lennon scores 
and scores on the achievement tests range from .50 to .80; correlations between 
Otis-Lennon scores and teacher grades are somewhat lower; and correlations between 
Otis-Lennon scores and scores o.. other ability tests are somewhat higher. 

To aid in the interpretation of scores on the tusts included in the College 
Entrance Examination Board Admissions Testing Program, the Board has published 
annually score report booklets for students, counselors, and admissions officers, 
and, periodically, much more comprehensive score reports. In addition, they 
have, through the years, commissioned a large number of research studies, and 
reports of many of these studies have found their way into professional journals. 
Two of these reports are particularly pertinent to the present discussion. 

Studies conducted by Roberts (1962), Hills, Klock, and Lewis (1963), Boney 
(1966), and Stanley and Porter (1967) gave evidence that the Scholastic Aptitude 
Test (SAT) of the College Entrance Examination Board was as valid for predicting 
grades of students in predominantly black colleges as for predicting the college 
grades of white students (Ker.jtick and Thomas, 1970). The possible bias of the 
SAT in predicting college grades at integrated colleges v?as investigated oy 
Cleary (1968) at the suggestion of the College Board. 

Cleary and Hilton (1968) had earlier investigated possible bias in the 
Preli minary Scholastic Aptit u de Test (PSAT) by studying the test items to see 
whether any items produced an uncommon discrepancy in scores for different 
racial and socioeconomic groups. On the basis of four separate studies of 
analysis of variance ati_ributable to (1) "race," (2) SES, and (3) items, in 
the responses of 1410 twelfth-grade students who had taken PSAT lit seven inte- 
grated high schools in three large metropolitan areas in 1961 (N = 636) or 1963 
(N * 774), Cleary and Hilton concluded that while there were a few items producing 
an uncommon discrepancy between the performance of Negro and white students, 
the PSAT for practical purposes was not biased either for different ethnic groups 
or for groups at different socioeconomic levels. They based their conclusion on 
the absence of interaction* effects between item and "race" or item and SES. 



’'‘Interaction between two variables in an analysis of variance is a term to der- 
cribe the tendency of individuals with particular combinations of status on the 
two variables to do much better or worse than would be indicated by their 
standing on the two variable? separ tely. Here, if "iv :c" or SLS ha given 
excessive disadvantage on particular iter^. the analysis of variance would 
have ? own large interaction effects between item and "race 11 and/or item and 



SES. 
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The possible bias of the SAT is predicting college graces of black students 
at integrated colleges was investigated by Cleary (1968) . f »he used the test as 
a whole as a predictor of college grade averages for both black and white stu- 
dents, hypothesizing that the test could be considered to be biased if too high 
or too low a criterion score was consistently predicted for members of the 
subgroup. Cleary concluded that there v r as no significant difierences in 
prediction for black ?i\d white students from the two Eastern colleges represented 
in the study. At a third college in the Southwest, significant differences were 
found in the regression lines for black and white students, but it was a matter 
of overprediction of college grades for black students by the use of the white 
or common regression lines. 

In a study parallel to Cleary's, involving 13 integrated colleges. Temp 
(in preparation) found that the use of a regression equation based on the 
majority or white student group resulted in tne prediction of college grades 
for black students that were higher than those that they actually earned. 
According to Temp, colleges might consider the possibility of using separate 
regression lines for black students. 

As this document is being written, a comprehensive technical report on 
research and development activities relating to the tests in the College 
Board Admissions Testing Program is in press (Angoff, ed.). In addition to 
an overview of administrative and technical problems of the program itself, 
thf report describes construction practices involved in the Scholastic Aptitude 
Test and the achievement tests, u'scusses the statistical characteristics 
of the tests, the score scales, test validity, and the norms, and summarizes 
the results of several special studies having to do with the possible effect 
on test performance of coaching, t^st repetition, fatigue, anxiety, curriculum 
bias, and social and cultural fact s. The Hilton and Cleary and the Cleary 
studies described above are among those reported, 

A two-part Report of the Commission on Tests (College Entrance Examination 
Board, 1970) offers a variety of position papers, informed b> research studies, 
on future directions for the College Board’s program offerings. The commission 
of 21 members were drawn from persons variously concerned and qualified to 
deal with emerging issues in the use and interpretation of the tests in that 
program. The papers in this compilation, covering a broad range of purposes 
and services, bear in varying degree on the issues under discussion here. 

In particular, the opening article of Part II. Briefs, by John Carrcll, endorsed 
by 19 of the 21 con.»issicn members, recomnenos revision of the widely used 
Schola g t ic Aptitude Test to accomplish better descriptive measurement of college 
applicants, especially the disadvantaged. Hope is express- ■ psychometric 

techniques might be applied to the development of tests that wil . provide for 
separate report scores for (1) verbal knowledge (culturally influenced), 

(2) reasoning ability (largely verba] buJ less influenced by breadth and rich- 
ness of cultural experience), (3) listening comprehension (a capability sepa- 
rately important and presumably less influenced by culture than reading), and 
(4) a de-emphasized section on quantitative reasoning (still hopefully allowing 
the culturally disadvantaged to show their potential as the present mathematics 
section does, relatively independent of '’zrbal facility) . The reader is directed 
to the original documents for the details which may be of particular interest 
and applicability ir. h.is own situation. 
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The American College Testing Program (ACTP), which seeks to serve the same 
function in college admissions, has its own intensive research studies in progress 
designed to identify item and/or test bias in its offerings. A major technical 
report, incorporating the findings of these studies, will likewise seek to map 
a course for the ACTP but is not scheduled for publication until late 1971 or 
early 1972, 

Two new tests designed especially for use with the disadvantaged have 
recently been reported in the literature: A Reading Prognosis Test , published 

by the Institute of Developmental Studies, and the Orr-Graham Listening Test , 
also known as BoLt for Boys' Listening Test, published by the American Institutes 
of Research. 

The Reading Prognosis Test is a 25-minute test, individually administered, 
measuring Language, Perceptual Discrimination, and Beginning Reading Skills. In 
a series of studies, the test was pretested and validated on balanced samples that 
included equal numbers of children from middle and lower socioeconomic groups and 
equal numbers of Negro and white children (Seiner and Feldmann, 1963). In an 
initial pilot study involving 40 children, the Reading Prognosis Test correlated 
,87 with the Gate s Primary Reading Tests; Word Recognition of 1958. A second 
study involved 126 children, tested in October with the new test and in May with 
the Gates Primary Reading Tests: Sentence Reading and Paragraph Reading . In the 

October testing, retesting within three weeks of the initial testing yielded a 
reliability coefficient of .93 for tha total group. At this time also the con- 
current correlation with the Lorge-Thc rndike Intelligence Tests for 138 children 
was .42 for the lower SES group and .21 for the middle SES group. The correlations 
of the Reading Prognosis total test score with the Paragraph Reading test ranged 
from .79 for the lower-class Negro female group to ,89 for the middle-class white 
male group. The total group correlation was .81. The correlations of the Reading 
Prognosis total test score with the Sentence Reading test ranged from ,61 for 
the middle-class Negro female group to .88 for the middle-class white female group. 
The authors concluded that the Reading Prognosis total test score, at the beginning 
of grade 1, is a good predictor of Gates scores for difference SES groups at the 
end of a year's instruction. 

In a later validation study involving 300 Negro and white first graders in 
a large urban area and in a suburban community, correlations between the Reading 
Prognosis Test and the Ga tes Primary Reading Tests: Paragraph Reading and the 

Metropolitan Reading Test at the end cf grade 1 ranged from .71 to .80, and cor- 
relations for separate ethnic and SES groups from .66 to .88 (Feldmann, 1965), 

Other and largely similar validation data are reported in the 1964-65 Research 
Memos of the Institute of Developmental Studies. Gen ally, the best prediction 
is shown to be for Negroes and for the lowest SES group. 

The Orr- Gr aham Listening Test was developed between 1964 and 1968, with the 
financial support of the College Entrance Examination Board, to identify educa- 
tional potential among disadvantaged eighth-grade Negro boys. The content of 
the test, an 86-item, 90-minute instrument administered o'ally, was designed to 
be of interest to boys of junior high school age. The stories in the test 
are based on such topics as spies, baseball players, cowboys and soldiers. The 
test was developed to elicit motivation through increased interest and to 
provide a test of aptitude which was not dependent upon reading proficiency. 

o 
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All research, from that preceding the actual development of the test, through 
preliminary tryouts to the final administration, was carried on in junior high 
schools in the District of Columbia. About 99 percent of the boys Included in the 
samples were Negroes. On the basis of a "final administration" of the test, Orr 
and Graham (196S) reported the test to be reliable, acceptable to the group for 
which it was intended, and uniquely different from the traditional aptitude and 
achievement tests. They obtained a split-half reliability coefficient of .85 and 
a Kuder-Richardson (20) reliability coefficient of .89. Correlations of the total 
test score with total scores on the School and College Ability Tec t (CCAT), 

STEP Listening , and S TEP Reading were .60, .49, and .69 respectively. The results 
showed that about 81 per cent of the boys like the Listening Test and preferred 
it to a reading test covering the same content. 

Carver (1969) reported on a replicatiaii of the Orr and Graham study with 
extension to other ethnic and income-level groups. In this study 615 eighth 
grade boys in the District of Columbia area, 314 Negroes (182 low-income, 132 
middle-income) and 301 whites (110 low-income, 191 middle-income) were administered 
the Listening Test, SCAT (Level 2), and STEP Listening , and filled out questionnaires. 
Family incomes of $5000 divided the low- and middle-income groups. 

An incidental reliability study of 1C2 low-income Negroes yielded an alternate 
form reliability of .78. For the low-income Negro group, correlations between 
the Listening Test and other test variables were highly similar to those in the 
earlier study; for all groups combined, the Listening Test correlated .69 with 
SCAT total score and .78 with STEP Listening , considerably higher than the corre- 
lations in the earlier study. The correlations between the Listening Test and 
STEP Listening ranged between .65 for the low- income Negroes and .79 for the middle- 
income Negroes. The low-income Negroes scored lowest on all tests, the middle- 
income whites scored highest on all tests, and the difference between these two 
groups was always greater than one standard deviation. The ques t ionnaire responses 
showed that all four groups preferred the Listening Test to SCAT, but only the 
two Negro groups preferred it to STEP Listening . 

Carver concludes that the reliability of the Orr-Graham Listening Test for 
lc* TT * income Negroes appears to be adequate and stable since there was little 
difference in the split-half correlations of the earlier study and the alternate 
forms correlations in his study. The concurrent validity is quite high, as 
indicated by the high correlation between the test and STEP Listening . The test 
also appears to be an adequate indicator of aptitude since the correlation with 
SCAT is high. He questions the high uniquenrss of the test for identifying 
educational potential among the disadvantaged; to Carver the test is unique only 
in that it is preferred by Negroes. He finds no support for the hypothesis 
from the earlier test results that the effect of disadvant agement may be more 
associated with reading proficiency than with verbal proficiency in general. 

The large Negro-white differences are apparent in the Listening Test as well 
as in the reading and verbal measures. 

In two other articles (1968, 1968-69) Carver further discusses the 
questionable uniqueness of the test and the failure of the test to lessen 
score differences between Negroes and Whites. 
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To summarize, systematic efforts ate being made by rest publishers and 
research agencies to review present test offerings and to in roduce new emphases 
to meet the problem of assessing the capabilities of disadvancaged :hildren. 

To date, the studies of old and new materials suggest possibilities but little 
accumulated capability for meeting the assessment problem directly. 

The negative evidence that tests standardized on other populations tend 
to overpredict the subseauent performance of disadvantaged individuals, hence 
are not unfair to them, is cold comfort. The challenge is to mount a campaign 
of innovative teaching and evaluative research that will enhance learning by 
describing learning progress directly, rather than to settle for procedures 
that are fair only in the sense that they reflect "fairly' 1 the current 
unmitigated disadvantages. 

Now that the problem of assessing the potentiality and achievement of 
variously disadvantaged children is being faced, we must trust to continuing 
honest effort to separate the essential from the secondary objectives of public 
instruction to provide differential criteria of effectiveness of instructional 
adaptations. Thereby, it should be possible to help those operating from 
limited backgrounds to achieve increasingly greater mastery of essentials, 
including a self-respect that allows them to make a distinction between the 
essential and the ornamental outcomes of education. 



A second source of information, and a valuable one, was the Information 
Retrieval Center for the Disadvantaged at Teachers College, Columbia University. 
Useful studies found th^re were concerned with the testing of the culturally 
limited at all levels, from preschool to college students and adults; the 
testing of non-whites, including the Negro, the Mexican-American, and the 
American Indian; and the advantages and disadvantages of particular tests and 
particular types of tests for use with non-middle-class white groups. 

Public libraries and university libraries gave access to the many periodicals 
in which articles were located through the Education Index , and to Diss er tation 
Abstracts and P sychological Abstracts . The libraries of two test publishers 
proved a good source for unpublished studies. A visit to the Institute for 
Developmental Studies resulted in the location of other pertinent data, ERIC 
abstracts for reports related to d isadvantaged and testing were examined. 

Research relating to the effects of cultural background on test scores and 
the kinds of educational opportunities that have been afforded or denied the 
disadvantaged as a result of test performance ha^ increased in volume and inten- 
sity as concern for the improvement and extensio; of opportunities generally for 
minority groups has become universal. But research of this kind is not new; 
for more than 60 years researchers have been exploring and reporting the 
complexities and problems of the use of tests with culturally different groups, 
even though for much of that time what they hud to report may have been listened 
to by relatively few. While the great bulk of this research has been reviewed 
in preparation for the writing of this document, no attempt has been made to 
summarize the research that has been summarized elsewhere, except for those 
studies that have particular pertenence here. Instead, emphasis has been put 
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on those studies which have been done since I960, most of them since 1965* 

Anyone interested in wider reading, particularly cf the earlier studies, is 
referred to a half dozen of the most comprehensive surveys of the literature* 

Lucas (1953) reviewed 253 pieces of literature relating to the effects of 
cultural background on scores on aptitude tests* Campbell (1964) included 
46 references in his review of research done between 1932 and 1963 concerning 
the testing of culturally different groups. Pettigrew (1964) in the biblio- 
graphy in his book on the Negro American listed airong his 565 references almost 
200 studies related to Negro-Amcrican intelligence. Shuey (1966) reviewed 
382 studies in the latest edition of her volume bearing on racial differences 
in intelliger ce; while her conclusions relative tc differences between Negroes 
and Whites, as determined by intelligence tests, lave been the subject of 
considerable criticism, few would contest the statement that her coverage of 
the literature of the last 50 years is extensive. Dreger and Miller (1968) 
reported a ccmprehnns Lve survey of psychological studies of Negroes and Whites 
done in the United States between 1959 and 1965. Flaugher (1970^ in a recently 
completed review of research on testing practices! minority groups, and higher 
education, lists 65 references covering the years 1913 to 1970, 

Studies of discrimination against minority groups in testing have usually 
dealt with tie aspects of test content, the norms population, and the interpre- 
tation of results. What about the testing procedure itself? Do certain testing 
conditions systematically favor one cultural or racial group over another — 
examinees r.*ice, test directions, pretest practice, speededness, test-wiseness l 
The next thme studies were concerned with some o: these conditions. 

Pelosi (1969) made a study of the effects of examiner race, sex, and 
style on the test responses of adult Negro examinees, In his experiment, 96 
Negro males were given six subtests of the Wechs * er A dult Intell i gence Sca le 
(WAIS), the Purdue P egboard , and the IPAT Cultu re Fair Intelligence Test , eight 
tests involving 12 scares, by examiners who included Negroes and whites, mal ’S 
and females, "warm 11 and "cold" personalities, with three examiners With'n each 
race-sex category. A separate analysis of variance was done for each of i\u 
12 scores, 

Non^ of the examiner attributes or the . interactions between them were, signi- 
ficant on seven of the eight tests. The exception was the Culture Fair Test, 
group administered , for which "cold treatr . nf by male Negro examiners resulted 
in substantially higher scores than those obtained by female Negro examiners." 

On all but one svbtest of WAIS, the mean scores were higher with white examiners 
and for examinees treated coldly. 

Pelosi writes: "Though differences were small and non-significant, the 
general direction contradicts the findings of previous research which suggested 
inadvertent negative bias due to white examiners." He suggests two weaknesses 
in the study, however! (1) The subjects were volunteers, enrollees in an anti- 
poverty work experience project, and were not as ’ego-involved" as would be the 
case in an actual testing situation. (2) The "warm" and "cold" examiners were 
not sufficiently different in the testing situations. 
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Abramson (1969) examined the effect of the race of both children and 
examiners on the child's performance on irhe Peabody Picture Vocabulary Test , 
an individually administered test. Two white and two Ne-ro examiners admin- 
istered the test to 88 and 113 white and Negro children in first grade and 
kindergarten, respectively, in an integrated urban school. The first graders 
had been in the school since their kindergarten year and the kindergar tners 
had been in school for five months. The children had usually seen the examiner, 
a parapro fess ional working in the school, at least once a day during the time 
they had been in school. 

The investigator found a small but statistically significant interaction 
of the examiner's race and the child’s race for firut graders but not for 
kindergartners . He suggested that this difference might have been the result 
of the first graders having reached an age of racial awareness, but there were 
no data available regarding racial awareness. 

A study reported by Dubin and Osburn (1969) \-7as directed toward investi- 
gating whether two other conditions, aspects of the test procedure itself-- 
extra preliminary practice and extra testing time — systematically favored 
white pxaminees over Negro examinees. Their sample included 235 Negro and 
232 white students, representing both high and low socioeconomic levels, from 
two high schools in Galena Park, Texas. All students in the sample were 
quite familiar with standardized tests. 

The Employee Aptitude Survey (four subtests) was used. Groups within each 
race in grades 9 and 10 were given the. test with regular time limits; in 
grades 11 and 12 extra time was allowed. Some groups took only one form of 
the test; other groups took both forms, with the first testing considered as 
practice. An analysis of variance was done. 

The order of mean scores WcS as follows: 



By SES and Race 



By Testing Conditions 



High SES Whites 
Low SES Unites 
High SES Negroes 
Low SES Kogrot 



Power test with practice 
Power test without practice 
Speeded test with practice 
Speeded test without practice 



Interesting findings of the analysis of variance were these: 

1. Extra practice was no noi t advantageous to Negro than to 
white groups . 

2. Both SES groups profited from extra practice to a comparable 
degree. 

3. When Negro and white groups, matched by sex, grade level, and 
SES were compared, improvement in re from speeded to power 
tests was no larger for Negroes than for Whites. 

4. High and low SES groups profited equally by the tripled time 
limits . 

When both extra practice and extra testing time were given, 
again the improvemen t was not significantly related to either 
race or socioeconomic status. 

ERIC 
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The authors concluded that the results implied in a general sense that 
"testing procedure itself is not a major factor in discriminating between 
culturally advantaged and culturally disadvantaged students." 

Goldstein et al. (1970) studied the effect of a specially designed enriched 
curriculum for 161 children on (1) average test performance over the two-year 
range from beginning pre-kindergarten to end of kindergarten., and on (2) stability 
coefficients over the same range for S tanf ord-Binet IQ, Peabody Picture Vocabu- 
lary Test ! and the Columbia Men ta l Maturity Scale . Treating these three measures 
as measures of various aspects of cognitive development, they concluded that 
although mean gains on all three measures were reliable, the PPVT was not sensi- 
tive to effects of special instruction of these young disadvantaged children. 

Lesser, Fifex, and Clark (1965) studied the influences of different social 
classes and cultures on patterns among mental abilities: verbal, number, rea- 

soning, spatial. They tested 320 first-grade children, including middle- and 
lower-class Chinese, Jews, Negroes, and Puerto Ricans, in New York City and 
New Rochelle, New York, with the Hunter Apticude Scales , designed for gifted 
four- and five-year-olds. Lecial class was based on the Hollingshead am. 

Kudlich index, using occupation, residence, and education of the head of the 
family as criteria. The scales were administered individually by well-trained 
psychometricians of the same ethnic group as the child. 

Split-half reliabilities for the different ethnic groups (N = 80 for each 
group) ranged from a low .80 for Jewish children on Space to a high .96 for 
both Negroes and Puerto Ricans on Numbers. Split-half reliabilities by social 
class (N ® 160 for each class) ranged from a low .80 for the middle class on 
Spa.ce to a high for the lower clas& on Numbers. The middle-class children 
were slightly higher on Verbal but lower on Reasoning, Number, and Space. No 
tests for significance across ethnic or social-class differences were reported. 

Means by ethnic group and social class are given below. 

Table 8 

Hunter Aptitude Scales 







Means for 


Ethnic 


Groups 


Me;. ns for Social Classes 




Chinese 


Jews 


Negroes 


Puerto Ricans 


Middle Clasf, 


Lower Clas 


Verbal 


71.1 


90.: 


74.3 


61.9 


76.8 


65.3 


Reasoning 


25.9 


25.2 


20 4 


18.9 


27.7 


24.2 


Number 


27.8 


28,5 


18.4 


19.1 


29.8 


25.6 


Space 


42.5 


42.5 


34.4 


35.1 


44.9 


40.1 



The greatest differences in standard deviation were in Verbal. 

An analysis of variance was done, and interactions of social class, ethnic 
group, a;v sex reported. The major findings were the t (1) differences in social 
cJans d£ produce significant differences in absolute level of each ability, but 
do not produce differences in the pattern of abilities; (2) differences in ethnic- 
group Hicubeiship produce differences in both absolute level and pattern of abilities 
(3) social class and ethni :ity interact to affect the level of nch ability, but 
do not int'^act to affect patterns. The authors concluded by proposing that "the 
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identification of relative intellectual strengths and weaknesses of members of 
different cultural groups become a basic and vital prerequisite to making 
enlightened decisions about education in urban areas/ 1 

Brazziel and Terrell (1962) inducted an experiment in the development of 
readiness in a culturally disadvantaged group of first-grade Negro children, most 
of them from sharecropper homes- Twenty-six of the children were assigned to an 
experimental group and the other 66 to three control group*, Parents of the 
children in the experimental group were involved in registration and in the devel- 
opment of readiness activities. The experimental group was given a six-week readi- 
ness program, which involved travelogues, 30 minutes of educational TV each day, 
and intensified activi ty to develop perception, vocabulary, and the will t.o follow 
directions. Weekly tests were given on some form of readiness. 

At the end of six weeks, the Metropolitan Readiness Test was given to both 
experimental and control groups, The test results of the experimental group were 
greatly superior to those of the control group, the percentile rank for total score 
for the experimental group being 50 as opposed to 16, 14, and 13 for control groups 
A, B, and C respectively. The mean 10 of the experimental grouD in the spring of 
Grade 1 was 106.5, while second-grade Negro children in the country averaged 91.4 
in the state testing program. Brazziel and Terrel 1 attributed the _>uccess of the 
program to "an efficacious combination of direct teacher-parent partnership, excel- 
lent materials, test wisdom development, and energetic, uninhibited teaching. ..." 

Dowd (1 59) studied sex and race differences in the ef fee tivenet ;> of various 
composite predictors of initial reading success and the relationship of children's 
self-perception to initial reading success. He tested 366 children frorr. a large 
suburban district at the end of Kindergarten with the Metropolitan R eadin ess Test s 
(MRT), both the 1949 edition and the 1965 Revision, the Clark and Ozehosky U-Scale 
measuring self-concept, and the Van Alstyne Picture Vocabulary Te st. At the end 
of Grade 1, be gave the Gat es Primary Reading Tests? Word Recogniti on to 232 of 
the original 366 children still in school. For all groups (Negro, white- boys, 
girls) the best predictor was the MRT, except for the 1965 Revision for Negro 
boys; for them a combination of the Numbers and Copying subtests in the 1949 
edition of the MRT provided the best prediction for the Gates tests. The U-Scalp 
added significantly to the prediction on some instances; the Van Alstyne test ^ net. 

Beidler (1969) worked with 276 students in Kindergarten through Grade 2 
two schools in a disadvantaged neighborhood in Bethlehem, Pennsylvania, to deter- 
mine the effects of the use of the Peabody Language Development Kits (PLDK) or the 
intelligence, reading, listening, and writing of disadvantaged children in the 
primai y grades. The experimental groups had seven months of use of the k^cs in 
addition to the rottnal language arts program followed by the control groups. 

The Lee-Clark Reading Readiness Tes t Wfs administered to the Kindergarten 
in the spring, and the Otii, -Lemon Mental Abi l ity Test and the Cooperat i ve Primar y 
Tests in Reading and Listening to grades 1 and 2, A writing sample, scored for 
quantity and maturity, was obtained from grader 1 and 2. 

M the Kindergarten level, there was a highly significant. difference in 
favor of the control group, leading one to suspect that the experlnental and 
control groups at that level may not have been initially comparable. For grades 
1 and 2, i.o significant differences were found on intelligence, reading, or 
listening scores; in grade 2, however, the experimental grouo "wrote a signifi- 
cantly greater number of running words than did the control group." 

Beidler described the implications thus: " . . .compared to conventional pro- 
cedures, seven months of PLDK lessons do not significantly improve the intelligence, 

^ Leading, listening, or writing of disadvantaged children in the primary grades." 
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In 1962 a study of socioeconomic status and school achievement was made by 
the California Elementary School Administrators Association. The School and 
College Ability Te st (SCAT) and the Sequential Tests of Educational Progress 
(STEP) were given concurrently to 3008 ^ixth-grac.e students in 40 schools in 
threa school districts. Grouping in te* is of socioeconomic level (SES) was 
accomplished by use of the Hoi l i ngehead Two-Factor Index , based on parent 
occupation and education level. The two top groups, of five, were combined to 
make four SFS levels. 

Of pertinence here are the correlations between SCAT and STEP by SES levels. 
Was the prediction equally good at all levels? 

The correlations between SCAT-Verbal, SCAT -Quant it at ive, and SCAT-Total 
and six STEP subtests by SES levels all followed the same general pattern. 

For all 18 sets of correlations, the lowest r’s were for the highest SES level. 
For 11 sets of correlations, the highest r’s were for the next to the lowest 
SES level. For none of the 18 sets of correlations, were the r f s for the lowest 
SES level as low as those for the highest SES level. In other words, the 
prediction was generally better for the lower SES levels than for the higher 
SES levels. 

The correlations between SCAT-Total amd STEP by SES levels, from high to 
low, are given belcw. 

Table 9 

California Correlations between SCAT-Total and STEP 
by SES Level 



STEP 
SES A 


_N 


Math 


Science 


SCAT- 
Soc. Stud . 


-Total 

Mg-. 


Lis t . 


Writing 


S.D. 

SCAT 


524 


.71 


.62 


.67 


.64 


.57 


.61 


10.7 


B 


566 


.78 


. 72 


.75 


.72 


.66 


.70 


11.3 


C 


524 


.81 


.78 


o 

CO 


.76 


.67 


.74 


9.0 


D 


553 


.76 


.74 


.79 


.77 


.66 


.69 


7.6 



* Standard deviation 



Roberts et al. (1963) reported a longitudinal study of the performance of 
69 Negro-Amer lean children on the S tanford-Bin e t Intelligence Scale , with 
special concern for the "causes or associated factors" of the observed dif- 
ferences. In this study different forms of the cest were administered to the 
children at age 5 and age 10, with the second examiner having no knowledge 
of the earlier results. Data were gathered on parent occupation, family 
pattern, and socioeconomic level. 

Over the five-year reriod, male mean 10' s fell from 96 to 88 and female 
mean lQ f s from 94 to 84, with the decreases being statistically different 
in both cases. The respective standard deviations were 17.5 and 21.4 for 
the males, a large increase, and 13.2 and 15.4 for rhe females. The decline 
in IQ for boys seemed to be related to low socioeconomic status and unstable 
and unfavorable family patterns; the decline in IQ for girls was slightly in 
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reverse. The number of cases, however, was so small for the subgroups that 
little confidence can be placed in the statistics reported. The largest 
decreases were with children showing the greatest difficulty with verbal 
skills. Verbal Absurdities was an T 'outs tand ing failure. ’* There was slightly 
less difficulty with Repeating Digits, and Making Change was relatively easy. 

None of the children tested at age 10 could pass the 10-year vocabulary test. 

To obtain normative data on intelligence and achievement for a large 
homogeneous sample for which there were no previous data, Kennedy et al. 

(1963) administered the S tanf ord- Blr ,i t Intelligence Scale and the Ca3 iforni a 
Achievement Tests (CAT) to a well-selected sample of 1800 Negro students in^ 
grades 1 through 6 in five Southeastern states. They reported results by 
metropolitan, urban, and rural counties, age, sex, grade level, and socio- 
economic status. 

For the entire sample the mean IQ was 80.7, with a standard deviation of 
12.4. The mean IQ decreased with age, with type of community (from metropoli- 
tan to rural), and with socioeconomic level (from high to low); it remained 
relatively stable by grade. ri he order of the items by difficulty was quite 
similar to that of the norming population. The Negro students were relatively 
high on Rote Memory, Digits, Making Change, and Days of the Week, and low on 
Abst? ct Verbal, Vocabulary, Absurdities, and Comprehension. 

On the CAT the mean grade equivalent on the total battery fell increasingly 
below the norm (from .2 in Grade 1 to 1.2 in Grade 5) and decreased with socio- 
economic level; there was, however, no difference in achievement by type of 
community. The correlation of the total battery vi^h the S tanford-Binet mental 
age was .69, about the level usually found for total school groups. 

Hughes and Lessler (1965) compared the Wachsler Intelligence Scale for 
Children (WISC) and Peabo dy Pictu re Vocabulary Test (PPVT) scores of 137 Negro 
and white rural school children of the lowest socioeconomic level in North 
Carolina. Ranging in age from 6 to 16, these children had been sent for testing 
because of suspected mental retardation. Could the shorter PPVT be substituted 
for the WI3C, usually given? 

Correlations between the two tests ranged from a low .21 for White Males 
for PPVT with WISC Performance to a high of .^6 for Negro Males for PPVT with 
the Full WISC. Seven of the 12 correlations were .55 or higher. All but 
one of the r r s was significant at the one per cent level and that one was 
significant at the five per cent level. Generally, the r*s for Negro children 
were higher than for white children. 

With the standard error of estimate* running from 7 to 14 points, the authors 
conclude that r, the PPVT has a distinct advantage over group tests of intelligence 
for these rural children. . . and would perform an adequate screening function when 
used in the school or by personnel from the mental health clinic. 11 



*The standard error of estimate is simply the standard deviation of -he differ- 
ences betw T een scores of the same individuals on the eriterJor- test r.vA the pre- 
dictor test, in this case expressed as in f s. It is to b. ,.is ¥ ‘ ‘ ^ hed from 

the standard error of measurement, which accepts the ta.-t being studied as its 
own proper criterion and seeks to estimate def a - * tore of tie value on this 

Q test from the hv; u tical true value that tb 1 -. test mea imeerf mrJ.y because 

.nitely long. Set definition of t\ ' :tandrvd I'-'ror of 
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Assign the children, partieulary disadvantaged rural children* to EMR 
classes on the basis of a vocabulary test. 1 

An investigation by Knej f and Stroud (1959) was planned^ first, to provide 
data on the social class or culture bias in intellectual testing ^nd, second, 
to ascertain interrelationship:? among certain relatively new intelligence tests 
and tests of scholastic achievement. The Lorge-Thorndike In telligence Tests (L-T) , 
Verbal and Nonverbal, the Davis-Eells Game s, Raven 1 s Progressive Matrices (RPM), 
and the Warner Index of Status Characteri s tics . All tests except the RPM were 
administered to a sample of 344 fourth-grade students in a Midwestern city, all 
the students present at the time in six of 18 elementary schools. One hundred 
sixty-four of these students who were in the fifth grade the following year 
were given the RPM. 

All of the intelligence tests and composite scores on the Iowa Tests of 
Basic Skills (ITBS) correlated significantly with social status and, with the 
exception of the RPM, to approximately fhe same extent. The L-T Verbal scores 
gave che best prediction of ITBS scores, followed in order by L-T Nonverbal 
and the Davis-Ee l ls Gam^s . The L-T Verbal scores alone cor. ^ated with ITBS 
about as well as did the entire battery of tests when combined in multiple- 
correlation design. The RPM correlated to a smaller degree with ITBS than 
did any other intelligence test. The analysis gvae little justification for 
the use of L-T Nonverbal, the D avis-Eells Games , and RPM in conjunction with 
L-T Verbal for general prediction purposes. This is not to deny, however, their 
usefulness in individual diagnosis. 

Davis (1969) followed 103 randomly selected students from Grade 3 through 
grade? 5 and 6 to "measure improvement, in test performance in disadvantaged 
inner-city poverty tracts" in Knoxville during a federally sponsored Communi- 
cation Skills Project. The Metropolitan Achievement Tests of Reading, Word 
Discrimination, Language Usage,, and Spelling were administered in Grade 3 in 
1967. Improvement was measured by relating to the 1965 results, 1966 and 1967 
scores from California Achievement Tests in Reading Vocabulary, Reading Compre- 
hension, Mechanics of English, and Spelling. Davis reports that "over the 
three test periods 48 comparisons for significance of differences. . .were 
run. Computed results indicated significant differences in thirty-two of 
the forty-eight comparisons." 

Davis states in his thesis that "A basis for comparability of the MAT and 
CAT subtests was accepted when given correlation coefficients between areas of 
the two tests ranged from .77 to .95." It should be pointed out that correlation 
indicates only similarity in rank; it tells nothing of the grade equivalent 
score J, which could differ by months for students taking the two tests. There 
are also questions as to how standard scores and raw scores could be compared 
across th^ two tests (and levels) as the Grade 3 results on the MAT were compared 
wi vh Grade 4 and Grade 5 results of CAT. Was "improvement" the gain from Grade 3 
to later grades in the achievement areas considered? This comparison of results 
across different tests is very common even though not proper. There is evidence 
that MAT and CAT, particularly, are not comparable as to grade equivalent scores. 
CAT gives higher results and grade equivalent scores have a much smaller standard 
deviation . 
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The report appears to be attempted evaluation fo the effect of a federal 
projec :* How could this be measured by using gain over two years? Tiere 
appears to be no relation of the gains to those of a group not in the study* 

What gains over the same period of time for the same schools had beta made in 
previous years? What national norms give 1.0 as a normal yearly gain? 

A study of Eagle and Harris (1969) examined the relationship between race 
and performance on two standardized reading tests, the reading tests of the Iowa 
Test of Basic Skills and the Metropolitan Achievement Tests . The tests 
were administered to 850 fourth-grade students and b50 sixth-grade students J.n 
all elementary schools of an urban district near New York City* Although white 
students earned higher scores than nonwhite students on both tests, th Metro- 
politan produced significantly greater differences between the races, at both 
grade levels, than did tha Towa. Ag Grade A, the Metropolitan gave white students 
a superiority over nonwhite students of .72 compared to .58 for the Iowa* At 
Grade 6, however, the Metropolitan gave white students a superiority over non- 
white students of 1,13 years compared to .73 for the Iowa, a difference of 
about five, months. Analysis of variance confirmed the statistical significance 
of these differences at both grade levels. 

In brief, the Eagle-Harris findings imply that white elementary ochool 
children are ’’favored 1 ’ by the Metropolitan whereas Negro children are ,f i:avored” 
by the Iowa when results are contrasted. Why is this so? Must one question 
the validity of one or the other of these highly respected tests? The authors 
suggest that in previous investigations involving comparisons among standardized 
achievements tests, little consideration has been given to the question of 
interaction effects between tests and sociocultural variables. Yet, failure 
to take into account significant interactions can mark important changes taking 
place in subgroup student performance and could provide the basis for erroneous 
or misleading evaluation of curriculum effectiveness. 

The implications of findings like those of Eagle and Harris could be profound. 
With the knowledge that one test would be more reflective of gains for a particular 
subgroup than another, what administrator would not choose to use the test that 
demonstrates the kind of performance, maximal or minimal, that will best suit his 
practical purposes? 

Santos (1967) studied the level and variability of achievement in educationally 
disadvantaged attendance centers in Iowa, and investigated item characteristics of 
the Iowa Teats of Basic Ski lls (I'fBS) between educationally di r advantaged and total 
representative groups. In the Iowa 1966 testing program with ITBS, the educationally 
disadvantaged schools in all grades and all test areas were almost a year below the 
norm for representative scfuols, Difference in item difficulty between representa- 
tive and disadvantaged schools was pronounced, and quite variable. The discrimina- 
tion indices were equally satisfactory in the two groups. Santos suggests that 
research with experimental programs implies a need for reducing cultural bias, 
adapting content to needs and interests, and adjusting the difficulty of the test 
materials* "At the present time statements of behavioral objectives. . .are not 
specific enough to be of much help to authors of achievement tests in determining 
content, emphasis, and grade placement." 
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Buchanan (1969) studied the effect of cultural deprivation on the approach 
to test- taking as indicated by response style to multiple-choice questions. 
Buchanan asked whether his social background, deficient education, and expedi- 
ence of failure would lead the deprived student to reject the problem-solving 
approach when he is faced with questions to which he does not know the answers; 
that is, does he guess indiscriminately rather than attempt to eliminate the 
less plausible distractors in multiple-choice questions to arrive at an '‘educated 11 
guess, as non-deprived students do? 

Buchanan used three different tests at one grade level and one test at 
three different grade levels and analyzed (1) items on which non-deprived and 
deprived students experienced equal difficulty and (2) items with matched dif- 
ficulty indices. For matched questions there was no difference between sub- 
cultural groups in the degree of selective guessing. Buchanan concluded that 
indiscriminate guessing is related to a real inf ormational deficiency rather 
than to differences in motivation. 

In a case study of the effects of educational deprivation on Southern 
rural Negro children, Green and Hoffman (1965) worked on the public schools of 
Prince Edward County, which were closed from 1959 to 1963. During these four 
years, most Negroes had no schooling (No Educ group); some had an average of 
one and one half years (Educ group) . 

Alter resumption of school operation, the Stanford-Binet Intelligence 
Scale and the Stanford Achievement Test -P ar tj.al Battery were given to 154 
No Educ and 125 Educ, Extensive tables given by chronological age in the 
Green and Hoffman report show that the extended educational deprivation had 
a depressing effect upon achievement and intelligence at all ages. Language 
deficits on the Stanford-Partial were greater than in other areas. On the 
Stanford-Binet at the earlier ages (some children had never been to school), 
the differences between IQ’s of children with No Educ and those with some 
Educ were as great as 30 points. In both the No Educ and the Educ groups, 
there was a negative relation between age and measured IQ. 

Lo Monaco (1969) studied four groups of disadvantaged ninth-grade Negro 
boys to determine their response levels to both standard and oral-visual 
administrations of two vocationally relevant instruments. The boys were assigned 
to two experimental and two control groups equated for age, reading comprehension, 
and socioeconomic level. 



Hypothesizing that reading deficits contaminate scores on standard versions 
of the instruments and that disadvantaged youth have better listening compre- 
hension abilities than reading ability, Lo Monaco administered three measures — 
the Metropolitan Reading Test (MRT) , the Ruder Preference Record-Vocational , 
and the Life-Planning Questionnaire-Modified (LPQ-M) — to all groups in the stan- 
dard version and in a modified oral-visual version involving no reading. The two 
experimental groups took both the standard version and the oral-visual version 
in difference sequence; one control group took the stand* rd version twice, and 
the othar the oral-aural version twice. 



Except for the Heading Test, oral-visual version scores were higher than 
the standard version scores on all measures; on the MRT, this was true f r 
the low reading caser only. The oral-aural version provided more relif s le 
O res of interests on the Ruder and of strivings on the LPQ-M than did the 
ERIC ard version. 
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According to Lo Monaco, "the findings of this study indicate that reading 
deficits are important response variables. . . Instruments can be modified 
to "mediate these difficulties." 

Alzobaie, Me tfessel, and Michael (1968) administered the Lorge-Thorndike 
Intelligence Tests , Verbal and Non-Verbal, three of Guilford's tests of creativity, 
the Test of Acack i c Performance-Reading , and two scales from the Cattell Culture Fair 
Intelligence Test 'to 122 disadvantaged tenth-grade Negro students, in a district 
adjacent to Watts in Los Angeles. Grade point a- rages (GPA) and SES indices 
from the Warner Index of Social Class scale were also obtained for each student. 

Intercorrelations among the predictors ranged from .23 to .82; the Guilford 
total score had correlat ion? ranging from .40 to .56 with the other predictors. 

The Lorge-Thorndike and Reading tests showed small but significant correlations 
with SES; the Guilford anu Cattell tests did not. Correlations with a convergent 
criterion measure*of academic success — GPA ranged from .29 and .32 for the Cattell 
scales to .56 for the Reading test; correlations with GPA for the three Guilforo 
tests, essentially divergent tests, were .46, .39, and .31, with .48 for the 



The authors conclude: 

Despite their brevity, the three essentially non-verbal tests 
of divergent production as well as their composite score 
showed promise in the prediction of GPA. Thus, the three 
Guilford tests afford an alternative means for predicting 
traditionally evaluated academic performance of culturally 
disadvantaged children, many of whom have substantial 
disabilities in both receptive and expressive language 
function relative to expectations of a middle-class Anglo- 
American culture. 

Harris and Lovinger (1968) investigated the commonly reported tendency of 
Negro IQ's to drop with increasing age in a longitudinal study involving 35 
boys and 45 girls in a very disadvantaged area in the borough of Queens, New 
York City, in a school which had the lowest achievement and highest transiency 
rate of any junior high school in the borough. All 80 students had been given 
the same tests from the first grade on: Grade 1, Pintnev-Cunningham Primary 

Test ; Grade 3, Otis Quick-Scoring Mental Ability Test: Alpha Level ; Grade 6, 

Otis Quick -S cor ing Mental Ability Tes t : Beta Level ; Grade 7, the Wechsler 

Intelligence Scales for Children (WISC) ; Grade 8, the Cattell Culture Fair 
Intelligence Test and the Pintner General Ability Tes t; Grade 9, WISC. There 
were 12 measures in all. 

No decrease in IQ was found throughout successive grades for this group 
of disadvantaged Negro adolescents. Mean IQ at Grade 1 was 98, then 94, 88, 
93, 96, 92, to 96 at Grade 9. On the WISC this group was not any more handi- 
capped on verbal than on non-verbal tests. At Grade 7 the mean was 93.8 for 
Verbal and 93.7 for Performance; at Grade 9 the means were 96.1 and 97.0, 
respectively. The correlations between the tests given two years apart were 
.87 for Verbal, .85 for Performance, and .89 for Full Scale. 

*The authors write: "Time limits of convergent tests favor the time-conscious 



composite . 
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The purpose of a study by Bradley (1967) was to investigate selected char- 
acteristics, academic performance, personal problems, and successes of Negro 
undergraduates in seven formerly all-white Tennessee colleges and universities. 

In addition to course grades, personal and social deta were collected on 583 
students over a twe-year period of means of interviews and a student questionnaire. 

One result is pertinent for reporting here. The multiple regression equation 
for best, predictions of grade point average (GPA) includes these variables in thi s 
order: (]) high school GPA, (2) a confidence in ability factor, (3) the American 

College Testing Program (ACTP) social studies score, and (4) a morale factor. The 
multiple R predicting college grades was .6131, with a standard error of estimate 
of .5451 (one half the difference betv?een two letter grades, ad C and B) . 

Interestingly, Bradley found that no ACT score other than that for social 
studies added any significant Increase, In Bradley’s words: "The ACT scores 

in English and math cannot be used as a basis for predicting the academic success 
of the Negro students in the same way that they are: used to predict college 
success for privileged white students." 

Boney (1966) studied 104 Negro boys 2 nd 118 Negro girls in Grade 12 in a 
Port Arthur, Texas, high school. The Cooperative S choo l and College Ability 
Tent (SCAT) had been given in Grade 8. Three sub tests from th e Differential 
A ptitude Tests were administered at the end of Grade 12, concurrent with the 
computation of the grade point avt :age (GPA). A multiple correlation of .80 
for boys and .82 for girls resulted when the predictors of junior high school 
grade point average, the Sequential Tests of Educati o nal Progress (STEP) in 
Language and Social Studies, the Cali f ornia Test of Mental Maturity , and the 
three DAT subtests wree combined. Because 97 per ce-nt of the parents were 
unskilled laborers, there was little discrimination in socioeconomic status 
(SES) and SES did not become part of the regression equation. Boney concluded 
that "Negro students are as predictable as other groups" and that "prediction 
could be made in junior high school." 

Wilson (1969) reported a study undertaken by College Research Center in 
order to facilitate the efforts of a group of eight highly selective liberal 
arts colleges for women to evaluate the progress of black students enrolled 
at the time and to develop rationales for extending educational opportunity to 
members of disadvantaged minority groups. The study focused on (a) selected 
characteristics of black women who entered member colleges of the College 
Research Center in 1965, 1966, and 1967, and (b) the correlational validity 
of standard admissions criteria for predicting college grades. 

Black students entering CRC — colleges during the study, themselves a select 
group, differed from their classmates in a varietyof educationally relevant 
ways — in socioeconomic background, career orientations, percei\ jd purposes of 
college, educational plans, attitudes, and in level of performance on standard 
admissions variables (measures 0 : academic aptitude, SAT Verbal and Mathe- 
matical), scores on College Board Achievement Tests, and secondary school 
standing. The findings of the study suggest that, despite such differences, 
forecasts of freshman-year academic performance are likely to be at least as 
accurate for black stuaents as for their white classmates. There is, moreover, 
some evidence that predictions made on the basis of standard formulas may tend 
to overestimate the first-year performance of black students in the several 
zges studied. 

3.2 
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1 ' 1 1 is commonly assumed that scholastic aptitude t _ us £.re based against 
culturally different or disadvantaged studentSr . -but it is important to know 
whether they have useful validities for predicting relative critter:' i for such 
students/ 1 So wrote Munday (1965), who studied the predictive value of the 
American College Testing Program (ACTP) for 1658 students in five 4-year Negro 
colleges in four different Southern states. Munday employed five separate criteria 
(college English, mathematics, science, social studies, and overall averages). 

He found that the multiple R's derived from optimally weighting fou' high school 
grades in each category was lower than the multiple R’s derived from the optimal 
weighting of the four ACTP tests- The latter R r s gave predictions of college 
grades that were as good for the Negro colleges as for all colleges using the 
ACT service. 

Munday described his findings as being consistent with thor ^ from other 
studies, that is, that graaes for socially disadvantaged students are generally 
as predictable as grades for other students using standardized me isures of academic 
ability. In Munday's words: "If such tests are culture-bound, as seems likely, 

this feature does not appear to detract from their usefulness as predictors of 
academic succers," 



Mexican American Studies 



In one of a series of studies investigating the possible bias of testing 
Spanish-speaking children in English, Davis and Personke (1968) gathered evidence 
concerning the effects of administering the Metropolitan Readiness T est (MRT) 
in English and Spanish to 88 Spanish-speaking children in their first school 
year in a South Texas city. Fifty-three of the children were enrolled in pre- 
first grade sections, or "readiness classes" designed for children deficient in 
the English language; 35 of the children were in regular first-grade sections. 
Early in the school year, the Spanish version of the MRT, with published test 
directions in English translated into South Texas colloquial Spanish, was admin- 
istered to ail of the children by the same individual, and the English version, 
according to school practices, by the classroom teachers. Contrasts of mean 
differences on subtest and total scores on the two modes of test administration 
yielded mostly non-significant differences. The children Performed at a signifi- 
cant! v higher level on the subtests on Word Meaning when the test was administered 
in Spanish; on the subtests on Alphabet and Numbers, however, significant dif- 
ferences favored the administration of the test in English. The findings did 
not show f hat administration of the MRT in English rather than Spanish resulted 
in any inadequate assessment of and substantial testing bias against Spanish- 
speaking children. 

As a second phas of th > study, Personke aud Davis (1969) administered 
the Metr politan Achievement Tests (MAT) in May to the first graders who had 
participated in the earlier testing with the MRT. The total score on the 
English administration of the MRT was u significantly better predictor of 
performance on the Word Knowledge subtest of the MAT than was the total score 
on the Spanish administration- For the other two subscores on the MAT, Word 
Discrimination and Reading, the English administration of the TUT yielded higher, 
but not significantly different, coefficients of correlation than the Spanish 
administration did. Of 12 comparisons made between the subtests of the MRT 
(English and Spanish versions) and the three scores on the MAT, six differences 
wnrp gratis tically significant, and these differences divided themselves equally 
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betveen the Erglish and Spanish administrations. The administration of the MRT 
in English rather than in the children’s native Spanish apparently did not result 
in test bias for these children, 

While the results of this research are interesting and impressive, one wonders 
how any other outcomes could have been anticipate i If children are being taught 
to read English, then their readiness tc learn should be best assessed in 
terms of then ability to cope with the English language; and the greater that 
ability, the greater the amount of progress in reading achievement to be expected. 

Karabinus and Hurt (196?) described the results of the revised Van Alstyne 
Vocabulary Teg t given to 535 six-year-old Mexican-American children attending 
poverty-qualifying schools in Tucson, Arizona. Spearman-Brown, Kuder-Richardson, 
and test-retest reliability coefficients for the scores of the Mexican-American 
children ranged from .76 (Kuder-Richardson) to ,87 ( test-re tes t) , as compared 
with .71 (Spearman-Brown) for the general norming population. Concurrent validity 
coefficients with the S tanf ord-Bene t Intelligence Scale , the We chsler Intelligence 
Scale J h,-r Children , and the M etropolitan Readiness Tests , were above. 60. While 
the Van Alstyne test was judged to be both reliable and valid for the measurement 
of mental ability of these Mexican-American children, the mean mental age for the 
two groups wan so much lower than that of the general norming population (33. A as 
opposed to 44 to 47) that a normalized frequency distribution of raw scores showing 
corresponding percentile ranks was developed for use with the Mexican-American 
children rather than the percentile ranks for IQ scores prm Med in the manual. 

It was suggested that the special norms might be useful when measuring oth^.r 
culturally disadvantaged children 

Morper (1967) studied the relationship between certain predictive variables 
and achievement measures for Spamsh-American and Anglo ninth graders in Oklahoma, 

To 50 children of each ethnic group he administered the Wechsler Intelligence 
Scales f or Children (WISC) , the Lorge-Thot ndlke Intel l igence Tests , and the 
School and Co lle ge Ability Test (SCaT) as predictive measures. Achievement mea- 
sures included teacher marks in English, mathematics, and science and the Metro - 
politan Achievement Tests , 

For the Spanish- American group, neither the WISC nor the Lorge-Thorndi ke 
IQ’s correlated at the 5 per cent level of significance with scores on the MAT; 
while for the Anglo group, all three predictor variables correlated satisfactor ily 
with the MAT scores, With teacher marks as cruecxon variables, the correl at ions 
for all predictive variables were significant for both ethnic groups, ihe greatest 
differences between the Spamsh-Amencan and Anglo groups were observed when reading 
ability and comprehension were most involved in the obtaining of a measurement, the 
difference l^ng in favor of the Anglo group 

Kimball (1968) studied parent and family influences on the academic achievement 
of Mexican-American students. His population included 145? Grade 9 students from 
eight junior high schools, 899 Mexican Americans and 558 Anglos. Twenty-thiee 
variables were tested for association with, (1) school marks, (2) achievement test 
scores, and (3) general ability Parental educational aspirations for their child 
was significantly related to all achievement variables and was more strongly related 
to achievement than were personal identity, background, ..imily structure, social 
status, and ethnic status Just below patent iruiuence m predictive ability Were 
per cent of Anglos in the school, socioeconomic status, father's education, family 
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intac tness , family birth in Mexico, grandparents ' residence, and birthplace of 
child. Sex, age, birth order in family, and family size were of little consequence. 

A comparison of Mexican-Amer ican and Anglo patterns of relationship between 
achievement and these independent variables were found by Kimball to indicate more 
overall differences than similarities 

Chandler and Plakos (1969) of the Mexican-Amer ican Education Project conducted 
an investigation to determine whether certain Mexi can-Amer i can students belonged 
in Educable Mentally Retarded (EKR) classes or vihether a language carrier prevented 
them from being assessed properly as to their native abilities to perforin cognitive 
tasks. Their sample included ^7 students oi Mexican descent* with a problem in 
using the English language, in grades 3 through 8 in iwo school districts, an urban 
and a rural district, in different geographical areas. 

The Spanish version o p r he Wechsler Intelligence Scale for Children (WISC) was 
administered and scores interpreted in terms of norms developed in Puerto Rico. 

(Because this version v T as in Puer t o-Rj can Spanish, some ite^s had tc be reworded 
and some changes made in the key,) The IQ’s so obtained were compared with previous 
IQ’s based on a test not identified The mean IQ gain was 12.4, with 44 of the 47 
students scoring higher on the Spanish WISC The median IQ was 83, as compared 
with a median IQ of 70 on the test administered earlier Only 9 of the 47 scores 
were below the cutoff IQ of 75 for EMR classes when the Spanish WISC was given, 

Of interest to note here is an experiment conducted by Palomares and Johnson 
(1966) that demonstrated the crucial role played by the psychologist in the over- 
representation of Mexi can-Amer ic an children, or, for that matter the overrepresentat ion 
of children of any minority group, in EMR classes Palomares and Johnson each 
tested and interviewed approxiraate ly 35 Mexican- American children, ages 7 to 14 
years, who had been recommended for EMR class placement. After testing the children 
with theWe chsler Intelligence Scale for Children (WISC) , the non-Spanish-speaking 
psychologist, Johnson, found of his 33 students, or 73 per cent, eligible for 
EMR classes, while the Spanish-speaking psychologist, Palomares, recommended that 
only nine of his 35 students, or 26 per cent, be placed in EMR classes. Clearly 
examiners, as well as test3, can differ even when the students tested are similar 
and the test used, the same. There is little doubt but that a larger scale experi- 
ment would result in similar findings- Incidentally, both examiners averaged IQ 
estimates of 95 on the Goodenough-Harr is Dr aw-a-Han and Draw-a-Woman Test for 
children on subsamples of 25 for whom the WISC total IQ’s averaged 70 and 75, 
respectively , 

Metfessel (1965) studied attitude and .teativity factors related to achieving 
and nonarhieving disadvantaged youth, largely Mexican-Amer ican . He found that 
Individual Tests of Creativity ace considerably superior in predicting the academic 
behavior generally and of Mexican Americans particularly, than traditional measures 
of intelle t and scholastic aptitude Correlations of the scores on these creativity 
tests with s ra( ^e point averages were ranging from .39 to 49 at the time Metfessel 
reported. The Inventory of Self Appraisal and the Meaning of Words Inventory, two 
relatively moependenL tests of the achievement motive, were correlating between 
.36 and ,44 with grade point average. Metfessel concluded that the results appeared 
to indicate that "the above three tests Lonbine to produce a potent unified approach 
to forecast student achievements." 
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The eight Mexican-Amen can studies bneflv annotated in this section cover 
thi.ily the same general issues treated more fully for blacks and whites of low 
socioeconomic status in the preceding sections. The added feature is the foreign 
language component; ghetto children suffer language handicaps, but nothing quite 
as "wrong" as a wholly different language base, The Palomares-Johnson difference 
of interpretation of essentially the same low performance on individual tests is 
an echo of the Kariger (1962) finding reported in the previous document that per- 
sonal judgment compounds the ethnic separation produced by objective measurement. 



Misuses of Tests 



Generally speaking, researchers are not studying or trying out and evaluating 
tests. They are studying other matters — problems, gains for compensatory programs, 
and the like. For the most part the tests are taken for granted as measuring instru- 
ments; in only a few cases are they questioned. That is undoubtedly why there 
are very few investigations of how well a test works — how valid it is — with specific 
differentiated groups. The published nationally standardized test is often accepted 
uncritically and/or simply used as the best, available instrument for the purpose 
at hand. 

Beyond the general acceptance of the test as "it,” the search of the literature 
ha3 uncovered some rather serious misuses of tests — using certain tests inappropriately, 
making comparisons across different tests, and reading into the test results more 
than the author and publisher intended, The Peabody Picture Vocabulary* Tes i has been 
particularly misued. This easy-to-give test seems to be widely accepted as a good 
measure of general intelligence rather than offering an estimate (only) of verbal 
intelligence. It is frequently used with culturally deprived children with very 
limited vocabularies and the results compared with those of the norms group. Its 
use as a screening device is justified — nothing more. 

Among other instances of misuse are these, which were written down as noted in 
reading the many studies abstracted for this report. The presence of a few such 
studies in this report is noted incidentally 

Assuming that a test designed for gifted children of one age is 

suitable, then, for use with older children with limited backgrounds. 

(See Hunter Aptitude Scales study, p. 21j 

More generally, assuming that a test constructed and standardized 

for children of a given age and/ci school experience, is equally 
valid for children of different ages and/or experience, 

Changing some items and some credited answers, but applying the 

regular norms, especially with Puerto Rican and Mexican-Amen can 
groups. (Noted in studies in preceding section) 

Testing sc early in preschool programs, in order to get a pretest 

base when Improvement is to be measured, that test results cannot 
be valid. When a child has never handled pencil or crayon, never 
had a book or booklet and turned pages, never followed group 
directions, never worked steadily in a se lf-dl rected situation, the.i 
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a group test like the Metropolitan Readiness Test carnot be a valid 
measure. It does not measure what the test is designed to measjre 
because test-taking is so new and unfamiliar. The resulting scores 
may be purely chance, or zero, although the children may have some 
degree of readiness. 

Posttests af er an interval of group experience and use of crayons, 
and so fetch, can produce a more valid result. But to measure score 
gains from pre- to posttesting «nd ascribe them to the effectiveness 
of the program in bringing about improvement in t tie traits measured 
is not justifiable if no training for the pretesting has been given. 
(Several Headstart evaluations suffer from this flai^.) 

Assuming that learning ability is measured by what has been learned, 

using the Peab ody Picture Vocabulary Test or even the S tanf ord-Binet , 
wi;h its heavy emphasis on vocabulary, or the Wechsler Intelligence 
Scale for Children , with children with limited backgrounds. The 
emphasis on evaluation in these early childhood programs should be 
on getting children ready to be taught. The emphasis should be on 
current achievement, rather than on '’intelligence," in assigning 
them to learning groups. 

Failing to separate reading and oral vocabulary in English from the 

appraisal of learning ability. Failure to use other than English- 
language tests for Mexi can-American children, and then classifying 
low scoring pupils as mentally retarded, is a clear example. (Noted 
in preceding section) 

Doing studies with very small numbers of students. In some studies, 

no tests of significance have been made and, if they had been, hardly 
any significant (meaningful) results could have been obtained because 
of the tremendous differences in score that would have been required. 

Many findings of "no significant difference" are attributable to the 
small numbers of cases involved. 

Failing to follow through for two, three, four years, or more. The 

lack of longitudinal studies is distressing. It is little wonder 
that the longitudinal study of the culturally deprived in compensatory 
programSi being conducted under the auspices of Educational Testing 
Service for the U. S. Office of Education — from age three to grade 3 — 
has been so widely hailed. There are no others like it. 

Interpreting scores of individuals on short subtests when the reliability 

estimates, simply because of the length of the tests, make it impossible 
to trust the results of comparisons Comparison of means for groups 
on the same data would be quite permissible because group means are 
often quite reliable enough for s,uch purposes. 

Comparing reliability coefficients without reference to differences 
in range of s ores. 

Treating different measures of learning ability as though the results 
on them were comparable. Often, no attention is paid to what the test 
is measuring, i.e., to its content. Thus, the Oood enou gh-Har ris Draw- 
a-Man and the Feabo dy Picture Vo cab ulary Test are often treated along 
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with the St anfo r d-Binet as though they were equivalent and similar measures. 
Results on grouu pencil-and-paper tests of mental ability cannot be 
treated as equivalent to the results from individual testing. 

■Attaching the same importance co predictive validity without intervention 
(in the form of compensatory training) as with it. When a minimum amount 
of intervention is used, predictive validity is an indicator of the 
usefulness of preliminary information; when substantial intervention 
is attempted, predictive validity is no longer subject to such simple 
interpretations. Successful intervention involves defeating predictions 
of failure. 



Just as much of the research on ability grouping has failed to produce con- 
clusive findings regarding the advantages (and the disadvantages) of such grouping, 
in like manner much of the research on the testing of the culturally limited has 
failed to produce conclusive findings regarding either the validity of the tests 
for the use being made of them or the validity of the interpretations of the 
test results for such students. 



As long ago as 1964, Fishman et al prepared a set of "Guidelines for Testing 
Minority Group Children." The reader may be referred to that source for a compact 
summary of the major issues. 

The discussion in this document has taken particular account of their first 
two major points regarding the importance of any differences found in reliability 
and predictive validity when the same instruments are used to evaluate minority 
and majority group children. Notice has been taken at several junctures that 
(1) reliability of a test is often equally great for minority as for majority 
groups, and (2) predictive validity is often as high for minority or mixed groups 
as for majority groups. In fact, instances have been reported in which predictive 
equations based on majority groups overpredict the subsequent academic achievement 
of minority students, thereby "favoring" the minority groups at choice points such 
as college admission or ability group assignment. 



The discussion proceeds farther, however, to consideration of factors that 
affect both measures taken at the imtal point of prediction and the later "final" 
point of assessing achievement. It is here that doubt and confusion remain. Equally 
lew effort and accomplishment at both points will contribute positively to pre- 
dictive validity. Does this lack of effort on tests ct both points, a failure to 
organize oneself for the ultimate in competitive effort, constitute a fundamental 
defect requiring remediation? Does modern life essentially require this competitive 
effort? If so, can it be learned? Meanwhile, what procedures can be adopted to 
keep these modifiable traits from unduly influencing initial measures? Can we turn 
to foreign students for a cue? Must we allow practically unlimited time for initially 
slow-paced children so they can take their time interpreting questions, reading and 
"translating" multiple-choice options, carrying through problem-solving operations? 



Also 
essential 
from what 
to foster 
cogni tive 



can we accept as a crucial goal of modern education the separation of 
objectives basic to success m school learning and later in employment 
have been considered marks of the educated person? If so, we may be able 
affective development of minority children and thereby indirectly their 
deve lopment . 




3 * 



-36- 



B. ALTERNATIVE STRATEGIES 



The research into the procedures for the use of tests in grouping students 
for learning has provided limited information* This research has been des- 
cribed in earlier sections of this report as generally inconclusive, with 
the learning environment uncontrolled and the affective domain de-emphasizad , 
There is real need for a \-;ell designed major program of longitudinal studies, 
including multi-variate and covariate analyses with consideration of the learning 
environment, in which the student's development is evaluated against criteria 
involving the cognitive, performance, and affective domains (Anderson, 1969). 
However, during the years required for such studies, certain helpful practices 
for the use of tests in the learning situation have been identified and can 
be described. The practices are concurred in by authorities from the field? 
of education and psychometrics. 



Individualized Instruction 



The purpose frequently stated for grouping children in learning situations 
is to provide for individual differences. In this subsection, selected pro- 
cedures are discussed for test utilization and the realizativ a of individualized 
i ns true tion. 

Perhaps, individualized instruction has as many definitions as there are 
"authorities" defining the term. Individualized instruction is htrein thought 
of as a process of designing the curriculum fot the individual (Goodlad, 1966; 
Rasmussen, 1968). In the process we x^ould start by developing rapport with 
the student. As rapport is established the teacher initiates an effort to 
define the student's characteristics. If not initially, as soon as feasible, 
tests and measures should be utilized by a competent person to assist in the 
definition of the student's characteristics. As the student enters school, 
for example, the tests might well include individual intelligence tests 
and/or reading readiness measures. 

After the teacher has established rapport with and gained a knowledge 
of the student, she is in h position to discuss objectives with the student. 

The objectives are mutually agreed upon and become those of the student. 

The curriculum content is selected by the teacher to support the student's 
objectives. The content includes relevant and realistic aspects of the 
cognitive, performance, and affective domains. 

The student progresses at his rate in the mastery of the identified 
curricular content. It is emphasized that the student progresses at his 
rate to master y. The mastery is normally determined in part, if not totally, 
by tests. The tests measure achievement and performance, and sample curricular 
content behaviors. The purpose of the testing is to establish mastery and 
readiness for the next curricular topic. In the event that the student 
has not mastered a given topic, he is not failed but continues to study the 
topic until mastery is obtained. 
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The procedures, materials, and methods used to guide the student in 
learning tlie content are individualized for the student ^Glaser, 196 i; Lindvall 
and Cox, 1969). In that the measures of cognitive processes and styles are 
in preliminary stages of development, they are not currently dependable for 
this purpose. Rather, the teacher should observe, both informally and system- 
ically, the means whereby the student learns, and proceed to guide the student 
on a pragmatic basis. 

How that we have individualized instruction, is it possible to group 
students for learning? Four possible procedures are suggested. They are 
not exhaustive of all possible procedures. They are judged, in the light 
of the findings of the preceding sections, to be the most promising. 

H et erogeneous Grouping 

Heterogeneous grouping involves the bringing together of students who 
deviate extensively on a given variable. For example, in an elementary 
school social science class a topic for discussion might be the State of 
California. The student’s knowledge of the state is the variable. Some 
student might have lived or visited in the state and obse v ved a great 
amount of realistic information pertaining to the state. A group is 
formed consisting of those knowledgeable students and those desiring to 
learn about the state. In this instance we have an "ad hoc" heterogeneous 
group. The knowledgeable members have an opportunity to gain in leadership 
and communication skills through instruction of the others. The others, 
with guidance, are motivated to learn that which their peers know. 

Heterogeneous grouping of this nature is practiced in the non-graded 
school. Children assigned in a non-graded school vary considerably in 
age, experience, and knowledge. The heterogeneity is planned so that the 
children can learn from each other. 



Stratified Heterogeneous Grouping 

The illustration just cited presents a clear case for the values of 
heterogen .ous grouping. But let us consider another situation commonly 
faced in elementary schools in which it has been customary to teach classes 
of 30 children or so j.n self-contained classrooms where the 30 children 
stay with the same teacher in the same room for practically the entire 
day. Suppose we accept the criticism of those who argue for homogeneous 
ability grouping to reduce the span of achievement in each classroom, yet 
are even more attentive to the criticism of those who argue against homo- 
geneous grouping of whole classrooms because of the stigma this places on 
those in the average and low groups while giving the high groups an unwhole- 
some feeling of general superiority. Can these views both be accepted in a 
plan of organization of classrooms that has its own peculiar advantages? It 
has been done. 
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In Baltimore, a fundamental plan of organization recommended as an 
alternative that meets these requirements* may be called a plat of "stratified 
grouping." Under this plan, if three classes of 30 are to be made of 90 
children ready to start fifth grade, the children would be ranked in order 
of excellence on some composite — say, a standardized test battery most 
recently given — and then be subdivided into nine groups of ten each. Teacher A 
would be given a class consisting of the highest oi first ten, the fourth 
ten, and the seventh ter.; Teacher B would have the second, fifth and eighth 
tens; Teacher C would then be given the third, the sixth and the ninth (lowest) 
tens. 



Note the several merits of this scheme. First, there is no top or bottom 
section; the sections overlrp, so invidious comparisons between groups are 
minimized. Second, each class has a narrower range than the full 90 have: 
Teacher A has the top ten, but none of the bottom 20; Teacher C has the 
bottom ten, but none of the top 20; Teacher B has neither the top nor the 
bottom ten. Third, teachers can give special attention where it is needed 
without feeling unable to meet the needs of the opposite extreme: Teacher A 

can give a little special attention to the top ten because the bottom 20 
are not in the class; Teacher C can concentrate on the bottom ten, without 
fear of "losing" the top 20. Fourth, each class has leaders of appropriate 
capability to stimulate each other in a fair competitive way while giving 
leadership to lower groups; note particularly that in Teacher C T s class, 
the top group is the third ten, a group that has probably always had to 
play second fiddle to some in the first or second ten. Finally, no 
teacher has to teach the bottom group of a homogeneous plan, that mixture 
of disruptive, leaderless children that lack motivation and capability 
and make teachers like homogeneous grouping, but equally dislike to teach 
the slow group. 

Such a method of grouping is not offered as a complete answer by itself, 
but as a constructive step in the right direction, It is, moreover, compati- 
ble with other special teaching arrangements like team teaching, peer 
tutoring, and early education. 

The history of heterogeneous grouping schemes is that they do not involve 
an additional expenditure of funds. Our third procedure is thought to 
involve additional funds, especially during the implementation phase. How- 
ever, the additional gains in this third procedure are judged to show a 
favorable cost-effectiveness trade-off 



Team Teaching 

The U. S. Office of Education has sponsored a number of efforts to develop 
specifications for new model elementary school systems. A total of ten (10) 
such models have been developed (Stauffer and Deal, 1969). Without exception 
each model, with numerous variations, has embraced the concepts of individualized 
instruction, mastery, and differentiated staff. The differentiated staff 
approach specifies various personnel categories for teachers such as aides, 
assistants, specialists, and the like (Allen, 1967). Each category has 



^ Elementary School Guide , Baltimore Public Schools, revised edition, 1967. 
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certain functions of prime responsibility. The team teaching staff is selected 
from these categories of teachers so as to satisfy the requirements cf a given 
si tuation » 

The team would normally contain or have readily available a specialist 
who would perform, or guide a competent teacher in, the diagnosis of the 
individual student. The specialist is trained in selecting and administering 
tests, interpreting test results, and defining appropriate programs of instruc- 
tion. After the objectives and content are defined for the student, the task 
of guiding the student T s learning is assigned among th^ team members as 
appropriate . 

In a team, normally, there is a considerable number of staff members, 
say six or more, and a large class, say 100 or more. Thus, it is frequently 
found that a number of students have a need to learn the same tasks. Groups 
of such students are formed and assigned to a designated teacher for the pur- 
pose of learning the specific tasks. The grouping is informal, ad hoc, and 
of short duration. In a situation of this nature the students and teachers 
are paired with Mie task to be accomplished. Grouping in this manner promotes 
the effective utilization of personnel and resources, and increased learning 
by the individual, without the identified detrimental effect of homogeneous 
grouping. 



Student Tutoring 



Tutoring of children deficient in academic skills by older children has 
been widely adopted within compensatory education programs. Not surprisingly, 
those tutored shew more than rormal gains over a period of instruction. What 
is perhaps somewhat more surprising, when older children — themselves deficient 
in basic skills--are paid to tutor younger children who are deficient, the 
gains of che tutors outstrip by far the gains of the tutored! 

Cloward (1967) reports a study in which children of junior high grade 
status who were two or more years retarded in reading, as measured by grade 
scores on a standardized reading test, were paid $1.25 per hour to tutor 
deficient fourth-grade children of similar ethnic background (Caucasian, 

Puerto Rican, Negro). The program was conducted over an academic year after 
the tutors had undertaken a period of preparation (also on paid time) for their 
teaching chores. The psychodynamics of the tutor growth is worth spelling 
out rather fully. 

First, these older students who had experienced the constant role of 
failures pitied or deplored by their teachers were now being asked, nay 
even paid, to make a contribution to others. Second, in preparing for this 
work they had learned the basis of the old maxim "1£ you want to learn 
something, teach it." Third, they could see their pupils learn, as measured 
by daily response as well as by terminal test. 

Specifically, using analyses of covariance to control tov small initial 
differences in reading scores, Cloward found that 100 deficient readers in 
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fourth and fifth grade who were, tutored for four hours a weak for 26 weeks did 
reliably better than 79 controJ. children at the end of that period, reversing 
somewhat the normal trend toward further retardation characteristic of their 
peers. Tests given five months apart showed average gains of 6 months by 
experimentals ; 3^ months by controls. Daring the same period 77 tutors, who 
averaged 0.8 grades deficient at the start, gained reliably moie than their 
32 controls by 1.7 grades. Bearing in mind that grade score differences at 
high school level are magnified by the fact that the slope of the growth 
curve is decreasing, the adjuster mean difference at the end is slightly more 
than half a standard deviation on the score scale. 



Early Childhood Education 



At least sine, the 1930's, when the studies emanating from the Iowa 
Child Welfare Research Station (Stoddard, 1943) challenged the then 
accepted concept of the constancy of r 1 e IQ (Hunt, 1961) with evidence that 
substantial gains or losses in intellectual competence could be generated 
by the nature of early environmental stimulation of children* many parents 
from the upper socioeconomic classes have been sending their children 
to nursery schools. Beginning sometimes as early as age 2, these children 
have enjoyed intellectual stimulation in a supportive emotional climate and 
have emerged readier to participate in conventional schooling at age 3 or 6. 
In many such schools, priority has been given to affective development over 
intellectual stimulation. In others, however, intellectual stimulation 
has been an integral feature of this early education. 

Currently, the debate rages about whether this early intellectual 
stimulation may be cast in a form that is best called early schooling, the 
earlier presentation of instructional stimulation ordinarily offered all 
comers r-t an approximately uniform starting point o‘ age 6 in grade 1. 

What is best done at earlier ages is still moot, but experiments with 
children beginning at age 3 in kindergarten (Mc’Cee and Brzeinski (1966); 
Brzeinski et al., 7967; Fortson, 1969j show conclusively effective gains from 
planned early schooling in kindergarten. The Denver data reported by 
Brzeinski show that reliable gains from such early instruction in reading 
persist at least through grade 5, with some spread to related curriculum 
areas. An important condition is that gains achieved in kindergarten 
shall be consciously built upon in successive grades rather than being left 
to conventional programs for incidental forwarding; indeed, children placed 
in conventional classes with children beginning the learning of reading 
at age 6 in grade 1 soon slip from being recognized by their teachers as 
advanced at that point to becoming ones less challenged by the teaching of 
already learned skills and eventually being not at all advanced over their 
peers , 

Implications of these and other findings for the enhancement of 
learning by disadvantaged groups would appear to be that the practice 
of beginning formal instruction at age 3 (with some imaginative adaptations) 
might well follow the established practice of the British infant school of 
begirning instruction for all children at this level. 
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A Note on Jensen and Other N ew Deve lop meats 



Because of the widespread publicity achieved by the debate over an 
article entitled "How Much Can We Boost the IQ and Scholastic Achievement?" 
by Arthur R. Jensen in the Winter 1969 issue of the Harvard Educational 
Review , some readers may wonder at its relevance to the issue of ability 
grouping. Jensen suggests that some children learn better by associ- 
ation (rote memory), others by fitting new learning into a conceptual 
framework by higher mental processes, and that the whole matter of 
efficient learning styles is related to genetically determined "intelli- 
gence" in which certain ethnic groups are on the average considerably 
better endowed than others. 

The reader is referred to the considerable bibliography of critical 
replies in subsequent issues of the Harvard Educational Review and elsewhere, 
listed at the end of this document. Suffice it here to quote from Cronbach's 
response and ad* our abbreviated critique. 

Cronbach (1969) says in part: 

Professor Jensen is among the most capable of today's 
educational psychologists. His research is energetic 
and imaginative. In the present paper, an impressive 
example of his thoroughness, I am sure every reader 
has had my experience of encountering valuable infor- 
mation in areas where he thought himself au courant . 

Unfortunately, Dr, Jensen has girded himself for a 
holy war against "environmentalists" and his zeal 
leads him into over-statements and misstatements. 

Despite the merits of Jensen's research remarked by Cronbach, and 
admitting the dubious propriety of some of the criticism addressed to Jensen 
for publishing data and argument that may be used for partisan ends, his 
presentation suffers from faults in at least five major respects: 

1) He starts in journalistic style to proclaim a finding, rather 
than in professional style to build a convincing case. 

2) Current brief and fragmented efforts ac compensatory education 
show little effect, but it is too much to say compensatory education 
has failed. Efforts expended on short-term early education have pro- 
duced modest gains in some instances; other experiments here and in 
other countries have succeeded (Brzeinski, 1967; Bloom, 1969). One 
might fairly add that no major effort comparable to the systematic 
discrimination of over three centuries against American blacks ha r. 
even been attempted, 

3) Traits with high heritability are of ten. modif iable (Goldstein 
1969), 

4) Education's business is with a substantial modifiability. 

Even a correlation of .87 between monozygotic twins leaves 25% of 
the variance unaccounted for (Bloom 1969) , 

5) He closes on a note that suggests the likelihood of his 
model of distinctive learning styles for variously di^Terer* child- 
ren without clear evidence of Lhe likely effectiveness of different 

O eaching styles for classroom groups. Since ^advantage *\m&s to 
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Jensen is an individual characteristic compounded of individual 
and group hereditary and environmental factors and their interactions, 
this can only imply responsiveness of teachers to all children with 
a variety of teaching styles rather than heavy dependence on one 
teaching style for children of each of the different learning styles. 
His discussion, moreover, leaves entirely out of consideration the 
teaching and learning that go on between children. 



Other new proposals, like performance contracts and vouchering of funds 
to parents to let them ,T buy ,f their children’s education from the best sources, 
are merely noted here. They are procedural rather than instructional variations. 
If used, it would remain for instruction to be designed as suggested here, or 
by more ingenious instructional plans; performance contracts and vouchering 
merely establish different contractual arrangements for authorizing instructional 
activity . 



SUMMARY AND CONCLUDING REMARKS 



After pointing out some of the pitfalls in the interpretation of tests used 
for grouping children with limited backgrounds and some of the efforts being made 
to provide better interpretative data, this document has been closed with a series 
of brief accountsof alternative strategies to ability grouping. These illustrations 
by no means exhaust the possibilities, but they constitute a set of mutually com- 
patible strategies each of which has separate merit. Heterogeneous grouping promotes 
communication and peer teaching. Stratified heterogeneous grouping furthers these 
same goals while reducing the extreme variations in a class that complicate group 
instruction. Team teaching permits flexible grouping tc achieve individual learning 
objectives. Student tutoring promotes learning by the tutors as well as by the tutored, 
a circumstance also furthered by stratified grouping. Early childhood education, 
at least from kindergarten at age 5, can undergird a persistent gain in mastery of 
fundamentals, Taken together, these alternative strategies constitute a constructive 
challenge to the unrealized advantages and actual deleterious effects of ability 
grouping in the areas of scholastic achievement, affective development, and the 
ethnic and socioeconomic separation (isolation, deprivation) of children. 
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